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Abstract 


Recent  advances  in  computer  technology  made  parallel  machines  a  reality.  Mas¬ 
sively  parallel  systems  use  many  general-purpose,  inexpensive  processing  elements 
to  attain  computation  speed-ups  comparable  to  or  better  than  those  achieved  by  ex¬ 
pensive,  specialized  machines  with  a  small  number  of  fast  processors.  In  such  setting, 
however,  one  would  expect  to  see  an  increased  number  of  processor  failures  attributable 
to  hardware  or  software.  This  may  eliminate  the  potential  advantage  of  parallel  compu¬ 
tation.  We  believe  that  this  presents  a  reliability  bottleneck  that  is  among  fundamental 
problems  in  parallel  computation. 

We  investigate  algorithmic  ways  of  introducing  fault-tolerance  in  multiprocessors 
under  the  constraint  of  preserving  efficiency.  This  research  demonstrates  how  in  certain 
models  of  parallel  computation  it  is  possible  to  combine  efficiency  and  fault-tolerance. 
We  show  that  in  the  models  we  study,  it  is  possible  to  develop  efficient  parallel  algorithms 
without  concern  for  fault-tolerance,  and  then  correctly  and  efficiently  execute  these 
algorithms  on  paraDel  machines  whose  processors  are  subject  to  arbitrary  dynamic  fail- 
stop  errors.  By  ensuring  efficient  executions  for  any  patterns  of  failures,  the  efficiency  is 
also  maintained  when  failures  are  infrequent,  or  when  the  expected  number  of  failures 
is  small. 

The  efficient  algorithmic  approaches  to  multiprocessor  fault-tolerance  presented  in 
this  thesis  make  a  contribution  towards  bridging  the  gap  between  the  abstract  models 
of  paraDel  computation  and  reaDzable  paraUel  architectures. 
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Chapter  1 

Introduction 


1.1  Overview  and  Motivation 

Massively  parallel  machines  and  networks  consisting  of  hundreds  and  thousands 
of  processors  are  in  existence  today,  and  the  multiprocessor  technology  is  con- 
tini'ing  to  evolve.  In  order  to  take  advantage  of  the  processing  power  of  these  parallel 
computing  environments,  there  is  a  corresponding  need  for  efficient  algorithms.  Signif¬ 
icant  research  in  the  past  decade  was  dedicated  to  the  development  of  efficient  parallel 
algorithms  that  assume  perfectly  reliable  parallel  computers.  However,  the  algorithms 
that  are  to  be  executed  on  realizable  parallel  machines  must  be  able  to  deal  with  unpre¬ 
dictable  system  failures.  In  this  dissertation  we  address  efficient  algorithmic  approaches 
to  multiprocessor  fault- tolerance. 

In  parallel  systems,  as  the  number  of  inexpensive  processing  units  grows,  one  would 
expect  to  see  an  increased  number  of  failures  per  individual  processing  element.  The 
hardware  failures  may  be  caused  by  fabrication  defects  or  by  intermittent  errors.  It 
would  be  too  costly  to  make  each  parallel  processing  element  as  reliable  in  hardware  as 
a  single  processor  machine.  When  the  number  of  processors  is  in  the  thousands,  it  is 
likewise  impractical  to  provide  fault  tolerance  and  fault  masking  at  the  level  that  can 
be  achieved  by  the  more  expensive,  specialized  machines  with  a  small  number  of  fast 
processors  such  as  Tandem  [17],  Stratus  [101]  or  VAXft  [26].  In  addition,  the  software 
for  these  systems  is  typically  more  complex  and  thus  less  reliable  than  the  software  of 
the  more  conventional  uniprocessors.  Therefore,  it  is  critical  that  fault-tolerant  versions 
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of  existing  or  new  algorithms  be  developed,  which  preserve  efficiency  under  adverse 
conditions. 

In  order  for  practical  multiprocessor  software  to  incorporate  specific  fault-tolerant 
algorithms  and  utilize  particular  techniques  in  assuring  fault  tolerance,  these  algorithms 
and  techniques  must  meet  the  following  criteria: 

Efficiency  :  algorithms  must  use  well  defined  and  bounded  computation  and  network 
resources.  Most  of  the  theoretical  work  to  date  has  concentrated  on  this  goal. 

Scalability  :  an  algorithm  should  exhibit  stable  and  predictable  performance  within 
parallel  environments  of  increasing  sizes.  The  traditional  asymptotic  analysis 
guarantees  this  goal  only  to  the  extent  that  reliability  and  important  implemen¬ 
tation  details  are  modeled  by  the  theory  (see  next  two  points). 

Reliability  :  an  algorithm  must  remain  operational  in  a  parallel  environment  subject 
to  failures.  Reliability,  efficiency  and  scalability  are  the  main  subject  of  this  thesis. 

Feasibility  :  an  algorithm  must  be  implementable  using  state  of  the  art  network  and 
computing  technology,  so  that  it  takes  advantage  of  the  considerable  computing 
resources  available.  Model  and  algorithm  simplicity  are  keys  to  being  able  to 
integrate  abstract  solutions  with  realizable  hardware  and  software  systems. 

The  efficiency  and  scalability  of  parallel  algorithms  have  been  the  subject  of  research 
since  the  seventies.  A  model  of  parallel  computation  known  as  the  Parallel  Random 
.4ccess  Machine  or  PRAM  [44]  has  attracted  much  attention,  and  many  “efficient”  and 
“optimal”  algorithms  have  been  designed  for  it  (the  surveys  [40,  58]  contain  a  wealth  of 
information  on  the  subject).  The  PRAM  is  a  convenient  abstraction  that  combines  the 
power  of  parallelism  with  the  simplicity  of  a  RAM  (Random  Access  Machine)  [39],  but 
it  has  several  unrealistic  features.  The  PRAM  has  the  following  requirements: 

1 .  Simultaneous  access  across  a  significant  bandwidth  to  a  shared  resource,  memory; 

2.  Global  processor  synchronization;  and 

3.  Perfectly  reliable  processors,  memory  and  interconnection  between  them. 
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The  gap  between  the  abstract  models  of  parallel  computation  and  realizable  parallel 
computers  is  being  bridged  by  current  research.  For  example,  memory  access  simulation 
in  other  architectures  is  the  subject  of  a  large  body  of  literature  surveyed  in  [98];  for 
some  recent  work  see  [49,  87,  97].  Computation  on  asynchronous  PRAMs  are  the  subject 
of  [29,  31,  45,  75,  78].  The  rebability  of  semiconductor  memories  has  been  thoroughly 
studied,  and  a  survey  can  be  found  in  [89],  while  the  theory  of  error  detecting  and 
correcting  codes  is  reviewed  in  [76].  The  fault-tolerant  issues  of  the  interconnection 
networks  used  to  integrate  processors  and  memory  modules  are  discussed  in  [2].  Fault 
tolerance  of  systolic  arrays  —  a  particular  class  of  parallel  machines  —  has  been  studied 
to  some  extent,  and  some  of  the  achievements  in  that  area  are  surveyed  in  [1].  AU  these 
areas  are  extremely  important  for  dependable  massively  parallel  computing.  In  this 
work  we  address  the  following  issues:  the  reUabibty  and  synchronization  of  parallel 
processors  that  can  be  modeled  by  PRAMs,  and  the  efficiency  of  computation  on  such 
processors. 

The  model  of  parallel  computation  that  serves  as  the  basis  for  this  work  is  the 
synchronous  PRAM  of  Fortune  and  Wyllie  [44],  with  concurrent  reads  and  concurrent 
writes  (CRCW).  The  convention  for  determining  which  processor  or  processors  succeed, 
when  concurrently  writing  to  shared  mei  ry,  is  immaterial  in  our  algorithms.  We 
investigate  fault-prone  PRAMs  whose  processors  exhibit  fail-stop  processor  behavior, 
such  as  that  of  Schlichting  and  Schneider  [90].  The  only  atomicity  requirement  is  the 
atomic  concurrent  writing  of  single  bits.  The  redundancy  provided  by  concurrent  reads 
and  writes  is  essential  to  our  model,  e.g.,  when  the  writes  are  exclusive  we  show  that 
efficiency  and  fault  tolerance  cannot  be  combined. 

This  thesis  includes  and  extends  the  study  of  fault  tolerance  that  was  first  formalized 
by  Kanellakis  and  Shvartsman  in  [55].  As  it  wais  shown  theic,  it  is  possible  to  combine 
efficiency  and  fault  tolerance  in  many  key  PRAM  algorithms  in  the  presence  of  arbitrary 
dynamic  processor  errors  when  processors  fail  by  stopping  and  do  not  perform  any 
further  actions.  The  key  to  such  algorithm  design  is  the  following  fundamental  problem, 
caUed  the  Write- All  problem: 

Given  a  P-processor  PRAM  and  a  0-valued  array 
of  N  elements,  write  value  I  into  all  array  locations. 


This  problem  was  formulated  to  capture  the  essence  of  the  computational  progress  that 
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can  be  naturally  accomplished  in  unit  time  by  a  PRAM  (when  P  =  N).  In  the  absence  of 
failures,  this  problem  is  solved  by  a  trivial  and  optima]  parallel  assignment.  However, 
it  is  not  obvious  how  to  design  solutions  that  are  efficient  in  the  presence  of  failures 
or  asynchrony.  The  first  algorithm  for  the  Write-All  problem  with  poly-logarithmic 
overhead  in  work  was  shown  in  [55]. 

Using  solutions  to  the  Write-All  problem,  we  show  that  arbitrary  PRAM  algorithms 
can  be  efficiently  and  deterministically  executed  on  fail  op  PRAMs,  whose  processors 
are  subject  either  to  arbitrary  dynamic  patterns  of  failu  -es,  or  the  dynamic  patterns  of 
failures  and  restarts. 

An  important  part  of  this  research  addresses  the  definition  of  models  for  fault- 
tolerant  parallel  computatic  ’.  based  on  selections  from  a  spectrum  of  types  of  failures 
and  on  the  architecture  of  the  system  used.  Such  modeling  must  necessarily  precede 
the  development  of  techniques  for  constructing  parallel  algorithms. 

There  is  an  interesting  body  of  research  in  distributed  algorithms  addressing  prob¬ 
lems  similar  to  those  we  consider  in  this  thesis.  Some  of  the  high  level  concerns  addressed 
by  our  parallel  algorithms  are  analogous  to  those  encountered  by  other  researchers  in 
their  work  on  distributed  algorithms.  Both  the  parallel  and  distributed  techniques  share 
the  goals  of  failure  detection,  load  scheduling  a,nd  progress  evaluation.  For  example,  fault 
tolerance  is  the  subject  of  significant  current  research  in  the  setting  of  dynamic  asyn¬ 
chronous  network  protocols.  Distributed  controllers  have  been  developed  for  resource 
allocation  in  network  protocols,  where  the  total  number  of  messages  sent  is  the  resource 
monitored  [4,  70].  Other  developments  [3,  13,  15]  solve  the  problems  of  executing  dis¬ 
tributed  algorithms  in  the  presence  of  dynamic  network  changes,  i.e.,  dynamic  changes 
of  the  computation  medium. 

One  of  the  important  results  of  the  work  in  the  distributed  model  is  that  it  is 
possible  to  take  synchronous  algorithms  or  algorithms  that  are  designed  for  fixed  network 
topologies  and  “compile”  them  so  that  they  can  be  used  with  asynchronous  networks 
or  the  networks  whose  topology  chc.  ges  dynamically  [15].  Our  research  begins  to  yield 
similar  results  for  certain  shared  memory  parallel  models. 

The  potential  reliability  advantage  of  distributed  computing  systems  is  due  to  the 
replication  of  resources.  The  resulting  redundancy  in  computation  is  a  trade-off  of  effi¬ 
ciency  (measured  in  terms  of  available  resources)  for  fault  tolerance.  MuUender,  in  the 
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1989  Distributed  Systems  edition  of  the  ACM’s  Frontier  Series  [77],  considers  indepen¬ 
dent  failures  essential  for  distributed  systems,  i.e.,  a  failure  of  a  single  processing  node 
must  not  lead  to  the  failure  of  the  entire  distributed  system.  He  gives  the  disqualifying 
disadvantage  of  multiprocessors  with  shared  memory  [77,  page  6]: 

What  disqualifies  multiprocessors  is  that  there  is  no  independent  failure: 
when  one  processor  crashes,  the  whole  system  stops  working.  .  .  . 

Although  MuUender  primarily  refers  to  the  problems  of  the  available  general  pur¬ 
pose  multiprocessor  architectures,  it  is  nevertheless  true  that  most  efficient  parallel 
algorithms  assume  perfect  processor  reliability,  and  therefore  these  algorithms  crash  or 
produce  incorrect  results  even  if  a  single  processor  failure  is  encountered.  The  underly¬ 
ing  theme  that  we  address  in  this  thesis  in  the  context  of  parallel  algorithms  is 

Combining  the  reliability  potential  of  distributed  computing  with 
the  speed-up  potential  of  parallel  computing. 


1.2  Contributions 

We  now  overview  the  contributions  of  the  research  included  in  this  thesis.  Our  main 
results  have  established  that  it  is  possible  to  combine  efficiency  and  fault-tolerance  for 
two  particular  parallel  models:  no-restart  fail-stop  CRCW  PRAMs  and  restartable  fail- 
stop  CRCW  PRAMs. 

We  formalized  a  model  of  computation  and  failures  (at  the  granularity  of  fail-stop 
processing  units),  and  defined  a  complexity  measure  for  evaluating  the  algorithms’ 
efficiency  and  fault  tolerance.  The  key  complexity  measure  that  we  define  general¬ 
izes  the  notion  of  work  of  parallel  algorithms  commonly  expressed  as  the  Parallel-time 
X  Processors  product  (in  the  absence  of  failures).  We  introduced  the  Write-All  paradigm 
and  showed  that  efficient  parallel  algorithms  can  be  made  robust,  that  is,  be  efficient 
and  correct  in  the  presence  of  arbitrary  fail-stop  errors  without  restarts  as  long  as  a 
single  processor  remains  active.  Specifically,  the  efficiency  of  parallel  algorithms  is  de¬ 
graded  by  no  more  than  a  multiplicative  factor  that  is  square  in  the  logarithm  of  the 
size  of  the  input.  In  many  circumstances  we  achieve  even  lower  overheads. 
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We  have  also  shown  that  concurrent  write  model  (CRCW)  is  essential  in  achieving 
robustness.  Concurrent  writes  is  the  source  of  redundancy  in  our  approach.  Without 
concurrent  writes,  algorithms  may  incur  up  to  a  linear  multiplicative  overhead  as  is  the 
case  with  the  concurrent  read  exclusive  write  (CREW)  model. 

For  the  no-restart  fail-stop  CRCW  PRAMs  we  developed  a  general  fault-tolerant 
PRAM  algorithm  simulation.  We  showed  how  to  execute  correctly  and  efficiently  any 
PRAM  algorithm  on  a  fault-prone  fail-stop  PRAM.  The  simulation  is  based  on  a  solution 
to  the  Write-All  problem  and  the  techniques  used  in  implementing  the  general  parallel 
assignment.  We  have  also  shown  that  any  parallel  algorithm  can  be  optimally  executed 
in  the  presence  of  arbitrary  fail-stop  errors  using  a  slightly  smaller  number  of  processors 
by  taking  advantage  of  parallel  slackness  as  advocated  by  Valiant  in  [98].  By  optimal 
execution  we  mean  that  the  asymptotic  efficiency  of  the  source  algorithm  is  not  degraded 
even  if  arbitrary  fail-stop  errors  are  encountered.  Given  a  A-processor  PRAM  algorithm 
we  simulate  it  efficiently  by  a  P-processor  fail-stop  CRCW  PRAM  algorithm,  for  P  <  N. 
The  simulation  is  optimal  for  P  <  A/(log^  A  -  log  AMoglog  A). 

Extending  the  fail-stop  no- restart  model,  we  formulated  a  restartable  fail-stop  CRCW 
PRAM  and  computations  in  this  model.  We  allow  the  PRAM  processors  to  be  subject 
to  arbitrary  stop  failures  and  restarts  that  are  determined  by  an  on-line  adversary.  The 
failures  result  in  loss  of  private  memory  but  do  not  affect  shared  memory.  For  this 
model,  we  define  and  justify  the  complexity  meaisures  of  completed  work,  where  pro¬ 
cessors  are  charged  for  completed  fixed-size  update  cycles,  and  overhead  ratio,  which 
amortizes  the  work  over  necessary  work  and  failures. 

We  present  a  simulation  strategy  for  any  A-processor  PRAM  on  a  restartable  fail- 
stop  P-processor  CRCW  PRAM  such  that  it  guarantees  a  terminating  execution  of  each 
simulated  A-processor  step,  with  O(log^  A)  overhead  ratio  and  0(min{A  -|-  Plog^  A  -f 
M  log  A,  A  •  po  59})  (sub-quadratic)  completed  work,  where  M  is  the  number  of  failures 
during  this  step’s  simulation.  This  strategy  is  work-optimal  when  the  number  of  simu¬ 
lating  processors  is  P  <  A/(log^A  -  log  A  log  log  A )  and  the  totad  number  of  failures 
per  each  simulated  A-processor  step  is  0(A/logA).  These  results  are  based  on  a  new 
algorithm  for  the  Write-All  problem,  together  with  a  modification  of  our  main  fail-stop 
algorithm. 

We  studied  the  lower  bounds  for  the  no-restart  and  for  the  restartable  fail-stop 
PRAMs. 
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We  show  that  there  exist  no  optimal  Write- All  solutions  for  iV-processor  no-restart 
PRAM,  even  if  each  processor  can  perform  memory  snaphsots,  i.e.,  read  and  locally 
process  the  entire  shared  memory  at  unit  cost.  The  result  showed  that  under  this 
hypothesis,  fl(7Vlog  A/loglogiV)  work  will  be  required.  This  is  the  strongest  possible 
bound  under  this  assumption. 

For  the  restartable  PRAM  model,  we  showed  that  the  Write-All  problem  requires 
fl(7VlogA)  completed  work  when  P  =  N,  and  this  lower  bound  holds  even  under  the 
additional  assumption  of  memory  snapshots.  Under  this  assumption  we  have  a  matching 
upper  bound. 

Despite  the  memory  snapshot  assumption,  these  lower  bounds  are  of  interest  also 
because  we  use  these  results  to  show  the  lower  and  upper  bounds  for  some  of  the 
algorithms  we  develop  in  this  work.  The  lower  bounds  for  both  models  also  apply  to 
the  expected  work  of  randomized  algorithms. 

We  also  show  that  for  some  fundamental  algorithms,  it  is  possible  to  construct  fault- 
tolerant  algorithms  that  improve  on  the  efficiency  of  naive  general  simulations.  For 
example,  we  show  how  to  use  the  Write-All  technique  to  achieve  savings  in  computing 
parallel  prefixes  for  any  associative  operation,  and  compute  list  ranking. 

Finally,  using  a  deterministic  bootstrapping  and  balancing  argument,  we  show  how 
to  solve  the  Write-All  problem  when  auxiliary  memory  is  contaminated  with  arbitrary 
values.  All  previous  Write-All  solutions  use  Q{P)  auxiliary  shared  memory  and  assume 
that  this  memory  is  cleared  or  initialized  to  some  known  value.  For  any  dynamic  pattern 
of  fail-stop,  no-restart  errors  on  a  CRCW  PRAM  with  at  least  one  surviving  processor, 
our  new  algorithm  writes  all  I’s  using  0{N  -}-  Plog^  A^/(loglog^  JV))  work  without  any 
initialization  assumption.  This  technique  can  be  combined  with  any  Wnfe- A //algorithm 
to  yield  efficient  simulations  of  any  PRAM  and  even  optimal  simulations  given  processor 
slack.  It  can  also  be  used  with  restartable  fail-stop  processor  simulations. 

1.3  Related  Work 

1.3.1  Fault-tolerant  parallel  computation 

The  study  of  PRAM  fault  tolerance  wais  initiated  by  Kanellakis  and  Shvartsman  in  [55], 
where  a  new  complexity  measure  for  fault  tolerant  PRAM  algorithms  was  defined,  where 
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the  notion  of  parallel  robustness  was  introduced,  and  where  the  Write- All  problem  was 
defined. 

The  techniques  presented  in  [55]  can  readily  be  employed  in  making  arbitrary  PRAM 
algorithms  fault-tolerant.  The  iterated  Write-All  paradigm  was  employed  (indepen¬ 
dently)  by  Kedem  et  al.  in  [59]  and  by  Shvartsman  in  [92]  to  extend  the  results  of  [55] 
to  arbitrary  PRAM  algorithms  (subject  to  fail-stop  errors  without  restarts).  In  addition 
to  the  genera]  simulation  technique,  [59]  analyzes  the  expected  behavior  of  several  so¬ 
lutions  to  Write-All  using  a  particular  random  failure  model.  The  algorithms  analyzed 
included  algorithms  from  [55]  and  [75],  and  a  new  algorithm  based  on  pointer  doubUng 
that  has  a  good  expected  behavior  for  the  failure  model  defined.  The  deterministic  exe¬ 
cution  of  PRAM  algorithms  in  [92]  is  optimal  for  any  adversary  when  parallel  slackness 
(as  in  [99])  is  exploited  to  our  advantage. 

Asynchronous  versions  of  the  PRAM  is  a  subject  of  recent  research.  Various  means  of 
relaxing  the  strict  synchronization  requirements  of  the  standard  PRAM  have  been  used 
to  show  that  efficient  algorithms  can  be  efficiently  executed  on  asynchronous  models 
[29,  31,  45,  78,  75]. 

A  simple  randomized  algorithm  that  serves  as  a  basis  for  simulating  arbitrary  PRAM 
algorithms  on  an  asynchronous  PRAM  is  presented  by  Martel  et  al.  in  [75].  This  ran¬ 
domized  asynchronous  simulation  has  very  good  expected  performance  for  the  Write-All 
problem  when  the  adversary  is  off-line.  Other  algorithms  in  this  model  are  given  by 
Martel  et  al.  in  [72]  and  Martel  and  Subramonian  in  [73].  Kedem  et  al.  [61]  further 
refined  the  results  in  [59]  to  produce  an  approach  that  leads  to  constant  expected  slow¬ 
down  of  PRAM  algorithms  when  the  power  of  the  adversary  is  restricted.  The  fail-stop 
deterministic  lower  and  upper  bounds  of  [55]  were  also  improved  in  [61]  by  loglog  A 
factors.  Recently,  Kedem  et  al.  [60]  further  investigated  the  use  of  randomization  for 
resilient  parallel  computation.  Martel  [71]  has  improved  the  analysis  of  the  main  algo¬ 
rithm  in  [55]  by  a  log  log  A  factor.  This  improvement  also  leads  to  the  upper  bound 
that  matches  the  lower  bound  in  [55]  under  the  memory  snapshot  assumption  as  we 
show  in  this  work. 

A  parallel  algorithm  animation  tool  was  developed  by  Apgar  [9]  to  aid  in  the  analysis 
of  Write-All  algorithms  using  Stasko’s  [95]  TANGO  animation  system. 

Our  modeling  of  fault  tolerance  where  a  processor  is  an  entity  subject  to  failures 
has  some  similarities  with  the  design  of  “robust”  sorting  networks  using  fault-prone 
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switches,  as  those  of  Rudolph  in  [88],  and  in  general  with  the  design  of  reliable  systems 
from  unreliable  components,  as  done  by  Pippenger  in  [82]  using  gates  or  by  Dwork  et  al. 
in  [38]  for  networks.  The  notion  of  robustness  that  we  target  in  this  research  differs  from 
that  of  the  sorting  network  in  [88],  and  in  that  network  a  linear  number  of  operations 
is  stiU  critical.  Another  example  is  the  emulation  of  PRAMs  on  faulty  hypercubes.  See 
the  the  recent  result  of  Aumann  and  Ben-Or  on  high  probability  emulation  [12]. 

Interesting  impossibility  results  for  asynchronous  shared  memory  models  are  given 
by  Herlihy  in  [47,  48].  General  synchronous  PRAM  simulations  are  impossible  using 
bounded  resources  on  asynchronous  PRAMs.  Buss  et  al.  [27]  show  that  some  deter¬ 
ministic  computations  can  be  performed  using  subquadratic  work,  even  when  arbitrary 
asynchrony  of  PRAM  processors  is  aUowed.  Anderson  and  WoU  [8]  also  showed  an  effi¬ 
cient  randomized  solution  for  Write-All,  as  well  as  the  existence  of  Write-All  solutions 
with  work  0(N^'^‘)  for  P  =  N  and  any  £  >  0  that  can  be  used  with  the  models  we 
define  here. 

Finally,  our  work  here  deals  with  dynamic  patterns  of  faults;  for  recent  advances 
on  coping  with  static  fault  patterns,  for  example,  are  addressed  by  Kaklamanis  in  [54]. 
The  granularity  of  faults  in  our  work  is  at  the  processor  level;  for  recent  work  on  gate 
granularities  see  [11,  82,  88]. 

1.3.2  Fault-tolerant  distributed  computation 

Adding  fault  tolerance  to  algorithms  is  the  subject  of  significant  current  research  in 
the  qualitatively  different  setting  of  dynamic  asynchronous  network  protocols  (recent 
results  and  an  overview  of  this  area  is  well  represented  by  [3,  4,  13,  15]). 

The  general  problems  encountered  in  fault-tolerant  parallel  computation  and  in  par¬ 
ticular  the  problems  of  allocating  active  processors  to  tasks  have  similarities  to  the 
problems  of  resource  management  in  a  distributed  setting.  Distributed  controllers  have 
been  developed  for  resource  allocation  in  network  protocols,  where  the  total  number  of 
messages  sent  is  the  resource  controlled.  For  instance,  the  algorithms  of  Lynch  et  al.  [70] 
(with  a  probabilistic  setting)  and  of  Awerbuch  et  al.  [4]  (with  a  deterministic  setting) 
are  among  the  most  sophisticated  in  that  area.  The  problem  we  address  in  this  thesis 
is,  at  an  intuitive  level,  one  of  controlling  resource  allocation.  The  resource  controlled 
is  all  available  PRAM  processor  steps,  and  the  reason  we  are  forced  to  control  it,  is  the 
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requirement  to  complete  the  computation  in  the  presence  of  faults.  Note  that  unreliable 
PRAM  processor  steps  must  control  all  available  PRAM  processor  steps.  This  intro¬ 
duces  difficulties  that  recall  the  presence  of  network  changes  in  [3,  13,  15],  i.e.,  dynamic 
changes  of  the  computation  medium.  Fault  tolerance  of  particular  network  architectures 
is  also  studied  in  [38].  However,  the  distributed  computation  models,  the  algorithms, 
and  their  analysis  are  quite  different  from  the  parallel  setting  studied  here. 

It  is  interesting  that  the  concept  of  a  “communication  complexity  controller”  first 
developed  for  distributed  computing  has  an  analog  in  parafiel  computing,  i.e.,  “an  al¬ 
gorithmic  transformation  that  guarantees  robustness”.  Note  that  the  parallel  setting  is 
simpler  to  define  and  has  easier  to  describe  solutions,  immediately  applicable  to  a  large 
body  of  existing  work  on  parallel  algorithms. 

Parallel  computation  in  the  setting  where  the  shared  memory  is  initially  contam¬ 
inated  has  some  similarities  with  the  notion  of  a  self-stabilizing  system  introduced  by 
Dijkstra  in  [34].  Paraphrasing  [34],  a  system  is  self-stabilizing  if  and  only  if,  regardless 
of  the  initial  state  the  system  can  always  make  a  state  transition  into  another  state, 
and  the  system  is  guaranteed  to  find  itself  in  a  legitimate  state  after  a  finite  number 
of  transitions.  Our  computations  using  initially  contaminated  memory  can  be  viewed 
as  self-stabilizing  with  respect  to  the  state  of  shared  memory.  In  order  to  describe  our 
technical  contributions  we  must  now  review  the  state-of-the-art  of  the  algorithmics  of 
Write- All.  For  the  most  recent  results  in  the  area  of  distributed  self-stabilizing  systems 
see  the  works  of  Awerbuch  et  al.  [14,  16]. 

Finally,  the  synchronous  parallel  setting  with  fail-stop  processor  errors  is  free  from 
the  limitations  inherent  in  the  asynchronous  environment,  or  the  situations  where  the 
processors  can  perform  malicious  actions  (see  [41,  69,  79]  for  surveys  of  the  topic,  and 
[37,  42,  43]  for  lower  bounds  results). 

1.3.3  Technology  for  fault  tolerance 

Several  engineering  and  technological  approaches  exist  to  implementing  parallel  systems 
that  enable  them  to  operate  correctly  when  they  are  subjected  to  certain  failures.  Al¬ 
though  these  research  and  engineering  areas  are  not  as  directly  relevant  to  our  research 
as  the  work  cited  earlier,  they  are  nevertheless  extremely  important.  The  methods 
and  technologies  summarized  below  are  instrumental  in  providing  the  basic  hardware 
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fault  tolerance,  thus  providing  a  foundation  on  which  the  algorithmic  and  software  fault 
tolerance  can  be  built. 


Fault-tolerant  memories 

Semiconductor  memories  are  the  essential  components  of  processors  and  of  shared  mem¬ 
ory  parallel  systems.  These  memory  are  being  routinely  manufactured  with  built-in 
fault  tolerance.  The  three  main  techniques  used  in  providing  memory  fault-tolerance 
are:  (1)  Coding:  in  addition  to  the  bits  being  stored,  this  technique  utilizes  additional 
(parity)  bits  in  conjunction  with  various  error  detecting  and/or  correcting  codes  (see 
McEliece  [76]  for  a  grand  tour).  (2)  Replication  or  shadowing:  two  or  more  copies  of 
the  memory  are  maintained  with  either  the  majority  vote  being  taken,  or  the  faulty 
units  being  shut  off  in  a  hybrid  approach.  (3)  Reconfiguration:  spare  memories  are 
used  to  replace  faulty  units  by  reconfiguring  memory  units.  The  survey  of  Sarrazin  and 
Malek  [89]  covers  these  techniques  that  are  used  to  make  memory  (cache  and  main) 
more  reliable  without  appreciably  degrading  its  performance. 


Robust  interconnection  networks 

Another  important  subject  that  has  been  the  target  of  work  is  the  area  of  fault-tolerant 
interconnection  networks.  Interconnection  networks  are  typically  used  in  multiproces¬ 
sor  systems  to  provide  communication  among  processors,  memory  modules  and  other 
devices  [52].  An  encyclopaedic  survey  of  the  interconnection  networks  is  given  by  Al- 
masi  and  Gottlieb  in  [6,  Chapter  8].  Theoretical  foundations  for  such  networks  are 
summarized  by  Pippenger  in  [83].  The  networks  are  made  more  reliable  by  employing 
redundancy.  A  survey  of  fault  tolerant  interconnection  networks  is  presented  by  Adams 
et  al.  in  [2].  An  interesting  interconnection  network  routing  strategy  was  described  by 
Preparata  [85],  in  which  fast  routing  is  achieved  by  allowing  for  some  messages  to  be 
lost  and  using  a  redundancy  scheme  [84,  86]  to  reconstruct  lost  information. 

The  area  of  fault  tolerance  and  efficiency  of  interconnection  networks  is  extremely 
important  as  an  enabling  technology  for  fault-tolerant  parallel  computation.  In  this 
thesis,  we  limit  our  work  to  the  design  of  algorithmic  techniques  that  assume  that  a 
robust  interconnection  medium  such  as  those  surveyed  in  [2]  is  available. 
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Fault  tolerance  in  special  purpose  parallel  computers 

Systolic  arrays  are  special  purpose  parallel  computers  that  lend  themselves  to  being 
engineered  with  fault-tolerant  features.  Algorithm-based  fault-tolerant  techniques  are 
designed  for  specific  algorithms  that  are  implemented  as  systolic  arrays,  and  where 
occasional  and  intermittent  failures  are  expected.  For  example  such  techniques  are  used 
for  various  calculations  on,  or  with  matrices.  Typically  a  limited  number  of  faults  can 
be  handled  by  systems  that  utilize  various  checksumming  methods  to  locate  faults  that 
caused  incorrect  values  to  be  computed,  and  then  reconstruct  the  correct  values. 

Reconfigurable  VLSI-based  arrays  are  used  when  permanent  faults  (i.e.,  due  to  fab¬ 
rication  defects)  are  the  primary  concern.  The  arrays  are  manufactured  with  spare 
modules,  such  that  a  number  of  failures  can  be  tolerated  by  detecting  faulty  modules 
and  either  bypassing  them  or  automatically  replacing  them  with  the  spare  modules.  The 
survey  of  Abraham  et  aJ.  [1]  overviews  the  algorithm-based  fault  tolerance  in  systolic 
arrays  and  reconfigurable  VLSI-based  systolic  arrays.  Relevant  theoretical  bounds  are 
given  by  Kaklamanis  et  al.  in  [54]. 

In  his  thesis,  Hughey  [50]  presents  a  programmable  systolic  array,  and  he  also  de¬ 
scribes  several  techniques  used  for  on-board  fault  detection  along  with  software  tech¬ 
niques  that  enable  the  bypassing  of  certain  processing  element  failures. 

In  the  next  section  we  apply  some  of  the  technologies  cited  above  to  show  an  example 
of  a  realizable  system  that  is  consistent  with  the  models  we  study  in  this  work. 


1.4  Relation  to  Physical  Systems 

The  abstract  models  of  parallel  computation  we  present  and  study  must  be  able  to 
reflect  or  capture  the  characteristics  of  actual  systems. 

Processor  delay  is  a  feature  of  any  multi-programming  environment,  in  which  process¬ 
ing  priorities  are  not  centrally  or  predictably  specified.  A  processor  may  be  temporarily 
or  permanently  suspended  due  to  an  external  event,  and  processing  resources  may  be 
unexpectedly  required  by  another  task  as  determined  by  the  underlying  system.  In 
the  synchronous  parallel  environment  that  we  study,  a  processor  delay  is  treated  as  a 
processor  failure  subject  to  a  possible  subsequent  restart. 
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Processor  failure  may  occur  either  because  of  a  physical  fault  or  because  another 
entity  in  the  system  preempts  processing  time  without  saving  the  old  state.  If  a  processor 
discontinues  its  operations  due  to  an  internal  or  external  event  for  the  duration  of 
a  computation,  we  must  assure  that  the  computation  in  progress  will  proceed  to  a 
successful  completion. 

Communication  delay  is  a  well-known  aspect  of  multi-component  systems  when  infor¬ 
mation  available  at  certain  components  is  needed  by  other  components.  Small  communi¬ 
cation  delays  can  be  consistent  with  a  system  that  is  designed  to  operate  synchronously. 
On  the  other  hand,  unpredictable  or  non-uniform  delays  introduce  additional  complexi¬ 
ties  to  the  design  of  algorithms,  bi  the  synchronous  parallel  setting  we  assume  that  the 
communication  delay  is  uniform  for  all  processors.  This  will  allow  for  the  complexity 
measures  to  be  meaningfully  applied  when  the  communications  delay  is  a  function  of 
the  size  of  the  task  and  the  number  of  processing  elements. 

Communication  failure  may  be  due  to  memory  failures  or  as  the  result  of  memory 
operations  by  other  processors.  If  the  communication  network  reports  the  failure  of  an 
operation,  the  processor  can  re-attempt  the  access,  and  the  situation  can  be  modeled 
as  a  communication  delay.  If  unannounced  failures  can  occur,  an  algorithm  must  either 
explicitly  check  its  write  operations  or  ensure  in  some  other  way  that  omission  of  a  write 
is  not  detrimental  to  performance. 

In  this  work,  we  treat  delay  and  failure  as  occurring  to  the  processors  only.  If  memory 
operations  are  atomic  and  synchronous,  they  may  be  assumed  to  be  instantaneous, 
and  the  communication  delays  or  failures  may  be  attributable  to  the  processor,  and 
accountable  at  the  processor  level  of  abstraction. 

An  architecture  for  a  restartable  fail-stop  multiprocessor 

The  main  goal  of  this  work  is  to  study  algorithmic  techniques  that  enable  efficient 
parallel  computation  on  realizable  multiprocessor  systems.  We  now  suggest  one  way  of 
realizing  the  abstract  model  of  computation  where  processors  are  subject  to  fail-stop 
errors  and  restarts,  i.e.,  the  model  we  formalize  in  Sections  2.5  and  2.6. 

Engineering  and  technological  approaches  exist  that  allow  implementing  electronic 
components  and  systems  that  operate  correctly  when  subjected  to  certain  failures  as  was 
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Figure  1.1:  A  robust  fail-stop  multiprocessor. 


overviewed  in  the  previous  section  (for  surveys  and  examples,  see  [33,  51,  53]).  We  will 
now  cite  the  particular  technologies  that  are  instrumental  in  providing  basic  hardware 
fault  tolerance  and  employ  these  technologies  in  a  foundation  on  which  the  algorithmic 
and  software  fault  tolerance  can  be  built. 

Semiconductor  memories  are  the  essential  components  of  shared  memory  parallel 
systems.  Memories  are  routinely  manufactured  with  built-in  fault  tolerance  using  repli¬ 
cation  and  coding  techniques  without  appreciably  degrading  performance  [89].  Intercon¬ 
nection  networks  are  typically  used  in  a  multiprocessor  system  to  provide  communication 
among  processors,  memory  modules  and  other  devices,  e.g.,  as  in  the  Ultracomputer  [91]. 
The  fault  tolerance  of  interconnection  networks  has  been  the  subject  of  much  work  in  its 
own  turn.  The  networks  are  made  more  reliable  by  employing  redundancy  [2].  A  com¬ 
bining  interconnection  network  that  is  perfectly  suited  for  implementing  synchronous 
concurrent  reads  and  writes  is  formally  treated  in  [62]  (the  combining  properties  are 
used  in  their  simplest  form  only  to  implement  concurrent  access  to  memory).  Finally, 
fail-stop  processors  are  formally  presented  and  justified  in  [90]. 

The  abstract  model  that  we  are  studying  can  be  realized  (Figure  1.1)  in  the  following 
architecture,  using  the  components  just  cited: 

1.  There  are  P  fail-stop  processors,  each  with  a  unique  address  and  some  amount  of 
local  memory.  Processors  are  unreliable. 

2.  There  are  Q  addressable  shared  memory  cells.  The  input  of  size  N  <Q  is  stored 
in  shared  memory.  This  memory  is  assumed  to  be  reliable. 

3.  Interconnection  of  processors  and  memory  is  provided  by  a  synchronous  combining 
interconnection  network.  This  network  is  assumed  to  be  reliable. 


1.5.  STRUCTURE  OF  THE  DOCUMENT 
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With  this  architecture,  our  algorithmic  techniques  become  completely  applicable; 
i.e.,  the  algorithms  and  simulations  we  develop  will  work  correctly,  and  within  the 
complexity  bounds  (under  the  unit  cost  memory  access  assumption)  for  all  patterns 
of  processor  failures  and  restarts  when  the  underlying  components  are  subject  to  the 
failures  within  their  respective  design  parameters. 

If  there  is  a  cost  L  associated  with  reading  or  writing  a  single  shared  memory  ceD, 
then  the  work  complexity  of  the  algorithms  and  the  simulations  that  we  studied  increases 
by  the  factor  L. 

1.5  Structure  of  the  Document 

The  rest  of  this  thesis  is  structured  as  follows.  Chapter  2  defines  and  motivates  the 
models  employed  by  this  research,  the  associated  measures  of  complexity,  the  models 
of  failure,  and  the  key  Write-All  problem.  Chapter  3  contains  the  definitions  and  the 
analysis  of  fault-tolerant  parallel  algorithms  using  three  processor  allocation  paradigms: 
global  allocation,  local  allocation  and  hashed  allocation.  In  Chapter  4  we  address  the 
lower  bounds  under  the  memory  snapshot  assumption.  In  Chapter  5,  the  building 
blocks  of  the  previous  chapters  are  used  to  implement  a  general  simulation  of  parallel 
algorithms.  There  we  also  discuss  of  improvement  to  the  oblivious  simulations.  In 
Chapter  6  we  solve  the  Write-All  problem  when  the  shared  memory  is  contaminated 
and  we  eliminate  the  requirement  of  atomic  writes  of  logarithmic  number  of  bits.  We 
conclude  with  a  discussion  in  Chapter  7. 

The  bibliography  is  followed  by  three  appendices.  Appendix  A  contains  the  detailed 
pseudocode  for  algorithm  W  and  two  lemmas.  Appendix  B  contains  the  pseudocode 
for  algorithm  X .  Appendix  C  is  reserved  for  mathematical  lemmas  used  in  the  lower 
bounds  proofs. 
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Chapter  2 

Models  and  Definitions 


Modeling  parallel  computation  and  processor  failures  must  go  hand  in  hand  with 
the  study  of  algorithms  and  their  complexity.  In  this  chapter  we  define  the  base 
models  of  the  computation  that  are  the  subject  of  our  research,  the  models  of  failure 
that  we  are  studying,  the  two  major  variations  of  the  fail-stop  parallel  random  access 
machine  and  the  rationale  behind  the  technical  decisions  that  we  made.  We  discuss  the 
definitions  of  the  complexity  measures  that  characterize  the  efficiency  of  algorithms  for 
the  models  selected  and  in  the  context  of  particular  failure  models.  We  also  formalize 
the  key  Write- All  problem. 


2.1  Base  Model  of  Computation 

We  study  fault-tolerant  algorithms  for  the  closely  coupled  synchronous  shared  memory 
multiprocessor  systems  where  the  processors  need  to  cooperate  in  working  towards  a 
common  computational  goal.  Specifically,  we  study  algorithms  for  the  systems  that  can 
be  modeled  by  the  Parallel  Random  Access  Machine  (PRAM)  of  Fortune  and  Wyllie  [44]. 

The  PRAM  model  is  used  widely  in  the  parallel  algorithms  research  community 
as  a  convenient  and  elegant  model,  and  a  wealth  of  efficient  algorithms  exist  and  are 
continually  being  developed  for  this  model.  The  surveys  of  Eppstein  and  Galil  [40] 
and  Karp  and  Ramachandran  [58]  cover  all  of  the  important  variations  of  the  PRAM 
model,  and  give  most  of  the  fundamental  PRAM  algorithms.  Instead  of  reiterating  the 
rationale  for  studying  the  PRAM  model  and  listing  the  variations  of  the  PRAM  models, 
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we  refer  the  reader  to  the  two  excellent  surveys  [40,  58]  that  succinctly  address  the  topic 
in  the  respective  introductory  sections.  For  the  base  model  in  this  work,  we  use  the 
following  definition  of  the  PRAM  [44]: 

1.  There  are  P  initial  processors  with  unique  identifiers  (PID)  in  the  range  1, . . . ,  P. 
Each  processor  has  access  to  its  PID,  and  the  number  of  processors  P. 

2.  The  global  memory  accessible  to  all  processors  is  denoted  as  shared,  each  proces¬ 
sor  also  has  a  constant  size  local  memory  denoted  as  private.  All  memory  cells 
are  capable  of  storing  0(logmax{A^,  P})  bits  on  inputs  of  size  N. 

3.  The  input  is  stored  in  N  cells  in  shared  memory,  and  the  rest  of  the  shared  memory 
is  cleared  (i.e.,  contains  zeroes).  The  processors  have  access  to  the  input  size  N. 

We  use  the  concurrent  read,  concurrent  write  (CRCW)  variation  of  the  PRAM,  in 
which  multiple  processors  can  concurrently  read  or  write  to  the  same  memory  location. 
In  the  algorithms  we  present,  all  concurrently  writing  processors  write  the  same  value 
making  our  approach  independent  of  the  CRCW  conventions  for  writes. 

Our  algorithms  are  described  in  a  model  independent  fashion  using  a  consistent  high 
level  notation  with  the  obvious  forall/parbegin/parend  parallel  construct.  Such  high 
level  notation  can  be  formalized  as  a  programming  language  that  can  be  compiled  using 
standard  compilation  techniques  and  the  techniques  specific  to  PRAMs  ais  discussed  by 
Wyllie  [102]. 

2.2  Measures  of  Efficiency 

Computation  speed-up  is  one  of  the  central  reasons  for  using  parallel  computers.  In  this 
section,  we  introduce  and  discuss  a  particular  way  of  flexibly  factoring  fault  tolerance 
into  the  conventional  definition  of  parallel  work.  This  definition  can  be  adapted  for  the 
particular  failure  models  that  we  examine  in  later  sections. 

If  a  task  can  be  done  in  time  T  using  a  single  processor,  we  would  like  to  perform  the 
same  task  in  parallel  time  t  =  TjP  using  P  processors.  This  optimal  linear  speed-up 
is  one  of  the  important  goals  of  parallel  algorithm  design. 
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More  formally,  let  parallel  work  be  the  product  of  the  number  of  processors  P  and 
the  parallel  time  r.  Parallel  algorithms  are  considered  optimal  when  the  parallel  work 
is  within  a  multiplicative  constant  of  the  best  known  sequential  time  (of  course  for  the 
computation  to  be  practical,  the  constant  must  be  small).  Even  if  not  optimal,  parallel 
algorithms  are  generally  considered  efficient,  if  they  attain  a  near  linear  speed-up.  That 
is,  using  P  processors  on  inputs  of  size  N,  the  parallel  time  achieved  is  T{N)/ P  (within 
a  multiplicative  factor  polylogarithmic  in  N),  where  T{N)  is  the  best  known  sequential 
bound  and  P  ranges  over  1, . . . ,  iV. 

Efficient  or  optimal  parallel  algorithms  have  been  developed  for  many  fundamental 
computation  tasks  such  as  manipulating  integers  (e.g.,  add  or  sort  N  integers),  ma¬ 
nipulating  lists  and  trees  (e.g.,  compute  the  rankings  of  the  elements  of  a  JV-size  list, 
and  compute  a  preorder  numbering  or  subtree  sizes,  . . . ,  of  a  A^-size  tree).  These  algo¬ 
rithms  play  an  important  role  in  realizing  the  promise  of  high  speed-ups  using  massive 
parallelism. 

Unfortunately,  the  quest  for  high  speed-ups  has  led  to  efficient  parallel  algorithms 
that  are  very  tightly  designed,  so  that  every  processor  is  fully  utilized  doing  some¬ 
thing  essential  for  resolving  the  input  task.  Thus,  parallel  algorithm  efficiency  implies 
a  minimization  of  redundancy  in  the  computation  that  leaves  very  little  room  for  fault 
tolerance.  It  is  interesting  to  note  that  most  of  the  known  efficient  parallel  algorithms 
do  not  terminate  correctly  or  become  quite  inefficient  if  they  are  perturbed  by  simple 
processor  errors.  These  perturbations  are  of  course  outside  the  original  setting,  but  are 
nonetheless  realistic. 

Once  processor  failures  are  introduced  into  a  parallel  computation  the  txP  measure 
is  no  longer  that  relevant.  This  is  because  the  computational  resource  is  no  longer  under 
the  control  of  computation  —  it  varies  due  to  failures,  and  only  limited  resources  may 
be  available  at  any  given  time.  The  efficiency  of  fault-tolerant  parallel  computation  is 
more  appropriately  measured  in  terms  of  the  processor  work  steps  that  are  available  to 
the  computation. 

Consider  a  computation  with  P  initial  processors  that  terminates  in  parallel-time  r 
after  completing  its  task  on  some  input  data  /  of  size  N  and  in  the  presence  of  fail-stop 
error  pattern  F.  If  Pi{I,  F)  <  P  is  the  number  of  processors  completing  an  instruction 
at  step  i,  then  we  define  the  following  measure;  5  =  5(7,  F,  P)  =  5Zr=i  ^i(Ii  F). 
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Figure  2.1:  Work  in  the  absence  (VF)  and  presence  (S)  of  processor  failures. 


Example  2.1  Work  subject  to  failures:  Consider  a  Po-processor  algorithm  that  termi¬ 
nates  in  time  tq  in  the  absence  of  failures.  The  fault-free  work  W  =  tq  •  Po  is  the  area  of 
the  dashed  rectangle  in  Figure  2.1.  Due  to  failures  the  algorithm  begins  with  Pj^  <  Po 
processors,  and  then  the  number  of  active  processors  is  reduced  to  P(, ,  Pj^  and  Pj^  at 
times  <1,  <2)  h  respectively  until  the  computation  terminates  at  time  ri.  The  work  in 
the  presence  of  these  failures  is  more  appropriately  described  by  the  area  bounded  by 
the  “staircase”  solid  lines  and  the  two  axis.  This  area  is  5.  □ 

S  is  used  as  the  basis  for  measuring  efficiency  of  fault-tolerant  algorithms.  The 
algorithms  are  studied  in  the  context  of  the  chosen  failure  models  and  in  conjunction 
with  natural  measures  of  efficiency.  We  use  S  to  define  available  processor  steps  (5*) 
and  the  notion  of  robustness  for  the  fail-stop  no-restart  model  in  Section  2.5,  and  we 
use  S  to  define  completed  work  (5"*')  and  overhead  ratio  (cr)  for  the  restartable  fail-stop 
model  in  Section  2.6. 

The  use  of  generalized  parallel  work  as  the  primary  complexity  measure  has  the 
additional  benefits  of  being  able  to  compare  meaningfully  the  efficiency  of  fault-tolerant 
and  non-fault- tolerant  algorithms.  For  example,  the  well-known  cl2iss  AfC  of  algorithms 
characterizes  efficiency  primarily  in  terms  of  (polylogarithmic)  time  efficiency,  even  if 
the  computational  agent  is  large  (polynomial)  relative  to  the  size  of  a  problem  [30,  81]. 
To  characterize  better  the  efficiency  of  parallel  algorithms,  the  efficiency  measures  need 
to  take  into  account  both  the  parallel  time  and  the  size  of  the  computational  resource, 
i.e.,  parallel  work.  Such  characterizations  of  parallel  algorithm  efficiency  are  defined  by 
Vitter  and  Simons  in  [100]  and  expanded  on  by  Kruskal  et  al.  in  [63]. 
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2.3  The  Write-All  Problem 

In  order  to  deal  with  failures,  it  is  necessary  for  the  correct  processors  to  detect  the 
failures  and  reschedule  the  work  of  the  failed  processors.  The  main  problem  here  is 
that  the  minimization  of  redundancy  in  the  computation  does  not  leave  many  resources 
for  failure  detection  and  load  rescheduling.  It  is  fairly  easy  to  see  that  naive  processor 
failure  detection  and  reassignment  strategies,  e.g.,  use  of  a  master  control  or  clustering  of 
processors,  are  inadequate.  A  master  control  strategy  is  sensitive  to  particular  patterns 
of  simple  failures.  Clustering  can  degrade  the  performance,  measured  as  the  worst  case 
work  5"  by  a  linear  or  greater  multipUcative  factor.  Let  us  illustrate  this  discussion  with 
an  example. 

Example  2.2  Write-All:  One  of  the  simplest  tasks  performed  by  a  PRAM  is:  given  a 
zero-valued  array  of  N  elements  and  P  processors,  write  value  1  into  each  array  location. 
We  call  this  task  Write-All.  When  P  =  AT,  this  Write-All  problem  is  trivially  solved 
in  constant  time  by  the  following  (PRAM)  program.  However,  even  a  single  processor 
failure  will  prevent  the  establishment  of  {x[t]=l  (»=1, .  .  .  ,N)]  as  the  postcondition. 

forall  processors  PJD  =  l-.JV  parbegin 
shared  integer  array  x[l..iV]; 
x[PID]  :=  1 
parend 

Simple  fixes  are  available  that  will  make  the  above  program  more  fault- tolerant.  For 
example,  for  a  small  number  of  failures  <  k,  consider  the  clustering  algorithm  below. 
This  algorithm  performs  well  for  dynamic  failure  patterns  with  few  errors,  but  poorly  if 
there  are  many  failures.  For  example,  if  k  is  fixed  and  N  variable  then  it  cannot  handle 
N/2  failures,  and  if  k  is  allowed  to  grow  to  N/2  then  5  becomes  quadratic  in  N. 

forall  processors  PID  =  1..N  parbegin 
shared  integer  array  x[l ..  A]; 
for  i  =  PID  to  PID  +  it  do 

Hi  <  N  then  a:[i]  :=  1  else  x[i  —  A]  :=  1  fi 
od 
parend 

□ 
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We  will  show  the  existence  of  an  7V-processor  CRCW  PRAM  algorithm  for  the 
Write- All  problem  of  Example  2.2,  for  which  S  =  0{N\o^  N)  for  any  pattern  of 
fail-stop  processor  failures.  This  solution  also  illustrates  the  notion  of  robustness.  The 
original  Parallel-time  X  Processors  product,  N,  is  increased  by  at  most  a  polylogarithmic 
in  the  size  of  the  input  multiplicative  factor,  for  any  dynamic  pattern  of  failures  with 
at  least  one  surviving  processor.  Note  that  there  is  no  knowledge  of  how  many,  when, 
or  which  processors  will  fail. 

Our  techniques  for  deriving  fault- tolerant  and  efficient  algorithms  are  based  on  ro¬ 
bust  solutions  for  the  Write-All  problem,  and  so  we  grace  it  with  its  own  name: 

Definition  2.1  Given  a  zero-valued  array  of  N  elements  and  P  fail-stop  PRAM  pro¬ 
cessors,  the  Write- All  problem  is  to  set  each  element  of  the  array  to  1.  □ 

Remark  2.1  The  main  point  of  Write-All  problem  is  not  initializing  an  array  of  N 
elements,  but  transforming  the  contents  of  shared  memory  from  one  state  to  another. 
Write-All  is  formulated  to  capture  the  computational  progress  that  can  be  naturally 
accomplished  in  unit  time  by  a  PRAM  in  the  absence  of  failures.  Write-All  can  be 
equally  defined  in  terms  of  computing  the  absolute  values  of  elements  of  an  array,  or 
copying  to  the  target  array  values  from  another  array. 


2.4  Models  of  Failure 

We  now  present  several  dimensions  of  failure  modeling  and  the  models  we  are  concerned 
with  in  this  work.  Significant  research  in  the  past  decade  was  devoted  to  the  study  of 
fault-tolerant,  distributed  computation  in  the  presence  of  arbitrary  (and  even  malicious) 
system  faults.  Byzantine  and  other  simpler  failures  were  extensively  studied  in  the 
context  of  distributed  algorithms  for  the  consensus  problem  (e.g.,  Pease  et  al.  [79,  80], 
also  see  a  survey  by  Fischer  [41]).  The  failure  models  for  parallel  computation  are 
constructed  from  failure  definitions  along  the  spectrum  of  failure  models  discussed  below. 

Types  of  failures,  such  as  byzantine,  omission  failures,  fail-stop  failures,  etc.,  have 
been  part  of  the  consensus  literature  as  distributed  algorithms  were  being  developed  to 
deal  with  failures  (e.g.,  as  discussed  by  Lamport  and  Lynch  in  the  survey  [66]).  In  the 
area  of  parallel  computation,  where  processors  are  more  tightly  coupled  as  compared 


2.4.  MODELS  OF  FAILURE 


23 


to  a  distributed  environment,  failures  need  to  be  classified  further  with  the  emphasis 
placed  on  the  more  benign  fail-stop  case.  For  example,  the  synchronous  parallel  setting 
with  fail-stop  processor  errors  is  free  from  the  limitations  inherent  in  the  asynchronous 
environment,  or  the  situations  where  the  processors  can  perform  malicious  actions  (see 
[79,  41,  69]  for  surveys,  and  [42,  43,  37]  for  lower  bounds). 

Using  state  of  the  art  technology,  processing  elements  are  being  designed  with  built- 
in  diagnostics  capabilities.  Upon  detecting  failures,  such  processors  can  isolate  them¬ 
selves  from  the  rest  of  the  computing  environment  without  harmful  effects.  Such  proces¬ 
sors  are  modeled  as  fail-stop  processors.  It  was  shown  by  Schlichting  and  Schneider  in 
[90]  that  using  a  formal  methodology  and  an  appropriate  programming  language  frame¬ 
work,  it  is  possible  to  construct  correct  algorithms  for  fail-stop  processors.  In  this  work 
we  in  turn  show  that  it  is  possible  to  construct  efficient  and  fault-tolerant  algorithms  for 
certain  classes  of  parallel  fail-stop  processors.  In  the  context  of  closely  coupled  parallel 
computation  the  fail-stop  failures  are  both  accurate  and  tractable. 

Adversaries:  The  notion  of  adversary  is  useful  and  important  for  the  study  of  par¬ 
allel  computations  under  different  models  of  failure.  An  adversary  determines  which 
processors  can  fail  at  what  step  of  the  computation  and  which  errors  can  be  caused 
by  the  failures.  When  the  adversary  is  omniscient,  it  has  complete  knowledge  of  the 
computation.  The  adversary  can  be  restricted  to  have  time  or  space  limited  knowledge 
of  the  actions  (e.g.,  those  performed  in  the  past  or  just  in  a  subset  of  the  processors). 
When  the  adversary  is  in  addition  on-line,  it  can  decide  during  the  computation  what 
processors  will  fail,  as  in  this  work.  Alternatively,  the  adversary  can  be  off-line  as  in 
Martel  et  al.  [75],  in  which  case  all  failure  decisions  are  made  prior  to  the  start  of  a 
computation.  Finally,  an  adversary  might  be  limited  probabilistically  as  in  Kedem  et 
al.  [59],  where  the  faults  are  occur  with  certain  probability.  Here  we  deal  with  the 
omniscient  on-line  adversaries. 

Granularity:  Whereas,  there  is  a  fair  amount  of  analysis  based  on  types  of  failures  and 
kinds  of  adversaries,  there  has  been  less  attention  paid  to  granularity  of  failures.  For 
parallel  systems  granularity  seems  to  be  a  key  concept.  Failure  granularity  defines  the 
extent  to  which  sub-system  failures  affect  the  overall  system.  Granularity  also  defines 
the  smallest  system  components,  such  that  a  failure  within  the  component  is  either 
completely  masked  by  the  component  or  causes  the  failure  of  the  entire  component.  In 
practice,  many  parallel  programs  are  implemented  using  threads  packages  [22,  36].  It  is 
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also  reasonable  to  study  failure  granularity  at  the  level  of  a  single  thread. 

Our  modeling  of  fault  tolerance  has  some  similarities  with  the  design  of  “robust” 
sorting  networks,  as  those  of  Rudolph  [88],  and  in  general  with  the  design  of  reliable 
systems  from  unreliable  components,  as  in  Pippenger  [82]  or  Dwork  et  al.  [38].  The 
distinguishing  characteristic  of  our  approach  is  the  investigation  of  fault  tolerance  at  the 
processor  granularity  as  opposed  to  gate  or  switch  granularities  [82]  and  [88]  respectively. 

Magnitude:  Many  hardware  oriented  fault  tolerance  techniques  provide  fault  masking 
up  to  a  pre-determined  limit.  In  a  distributed  setting,  some  algorithms  can  handle 
processor  failures  when  the  number  of  failures  does  not  exceed  a  certain  fraction  of 
the  total  number  of  processors.  It  is  important  to  develop  techniques  that  can  deal 
with  any  number  of  failures,  however  such  techniques  should  also  yield  good  results 
when  the  number  of  failures  is  relatively  small.  The  efficiency  of  fault-tolerant  solution 
may  depend  on  the  maximum  allowable  number  of  failures,  but  it  is  imperative  that 
computations  remain  correct  for  any  number  of  failures  (when  one  or  more  processors 
remains  operative).  Here  we  study  arbitrary  failures  and  arbitrary  failures  and  restarts. 

Recovery:  In  some  models  it  is  reasonable  to  assume  that  faulty  processors  never 
recover.  For  example,  manufacturing  defects  may  permanently  disable  some  of  the 
systolic  array  processors,  while  the  array  remains  functional  when  equipped  with  on¬ 
board  fault-tolerance  [1,  28,  50].  It  is  also  reasonable  for  processors  to  recover  at  some 
point  and  rejoin  a  computation  in  progress.  Failures  may  be  quantified  by  the  duration 
of  a  processor’s  absence  from  a  computation.  We  consider  both  the  no-restart  and 
restartable  models. 

Frequency:  A  final  but  significant  dimension  is  the  frequency  and  timing  pattern  of 
failures.  Assumptions  about  failure  frequency  must  underlie  any  probabilistic  analysis. 
In  addition,  we  believe  that  the  fault  tolerant  algorithm  must  show  graceful  degradation 
of  performance  so  that  when  the  failures  are  infrequent  the  algorithms  must  be  near  the 
peak  of  their  efficiency.  In  this  work  we  place  no  restriction  on  the  frequency  of  failures. 

In  the  next  sections  we  define  two  variations  of  the  PRAM  whose  processors  are 
subject  to  stop  failures.  The  two  models  are: 

1.  The  fail-stop  PRAM,  where  the  processors  do  not  restart  after  a  failure,  and 

2.  The  restartable  fail-stop  PRAM,  where  the  processors  can  restart  after  a  failure. 


2.5.  NO-RESTART  FAIL-STOP  CRCW  PRAM 
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We  study  fail-stop  processor  errors  [90]  that  are  determined  by  the  worst  case  om¬ 
niscient  on-line  (adaptive)  adversary  that  is  not  limited  as  far  as  the  frequency  and 
magnitude  of  errors  are  concerned. 

In  each  of  the  models,  the  patterns  of  processor  failures  will  be  specified  as  sets  of 
triples  <tag,  PID,  t  >  where  tag  is  label  indicating  the  type  of  an  event  (i.e.,  failure  or 
restart),  PiD  is  the  processor  identifier,  and  i  is  the  time  instance  indicating  when  the 
event  occurs.  The  size  of  the  failure  pattern  F  is  defined  <is  the  cardinality  |F|. 


2.5  No-restart  Fail-stop  CRCW  PRAM 

We  begin  with  the  PRAM  model  given  in  Section  2.1.  This  model  is  extended  with 
a  failure  model,  and  a  complexity  measure  that  captures  the  work  of  a  fault-tolerant 
algorithm  when  its  processors  are  subject  to  failures. 

2.5.1  Failure  model 

The  fail-stop  CRCW  PRAM  extends  the  basic  model  by  allowing  processor  failures. 
The  failure  model  for  the  fail-stop  CRCW  PRAM  is  defined  as  follows: 

1.  We  allow  any  dynamic  pattern  F  of  processor  fail-stop  errors  provided  one  pro¬ 
cessor  survives  (one  processor  is  necessary  if  anything  is  to  be  done).  F  describes 
which  processors  fail  and  when.  This  pattern  is  determined  by  an  adversary,  who 
knows  everything  about  the  structure  and  the  dynamic  behavior  of  the  algorithm. 

2.  We  only  consider  fail-stop  (no  restart)  behavior:  processors  fail  by  stopping  and 
not  performing  any  further  actions.  Fail-stop  models  are  reasonable  approxima¬ 
tions  of  what  is  desirable  and  achievable  in  practice  [90]. 

3.  We  eissume  that  the  shared  memory  writes  of  the  individual  PRAM  steps  are 
atomic  with  respect  to  failures:  failures  can  occur  before  or  after  a  shared  write  of 
0(logmax{A^,  P})  bit  words,  but  not  during  the  write.  This  non-trivia!  assump¬ 
tion  is  made  only  for  simplicity  of  presentation.  Algorithms  using  this  assumption 
can  be  automatically  converted  to  use  only  single  bit  atomic  writes  as  we  show  in 
Section  6.2. 
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The  failure  patterns  are  syntactically  defined  as  follows: 

Definition  2.2  A  fail-stop  no  restart  failure  pattern  F  is  a  set  of  triples  <tag,  PID,  t  > 
where  tag  is  failure  indicating  processor  failure,  Pin  is  the  processor  identifier,  and  t  is 
the  time  indicating  when  the  processor  stops  or  restarts.  □ 

Remark  2.2  Since  at  least  one  processor  must  survive  if  a  computation  is  to  terminate, 
we  need  only  consider  failure  patterns  F  of  size  [F]  <  P  for  computations  with  P  initial 
processors. 

2.5.2  Measure  of  efficiency:  available  processor  steps 

We  define  the  complexity  measure  of  available  processor  steps  defined  using  the  mea¬ 
sure  introduced  in  Section  2.2.  This  measure  generalizes  the  standard  Parallel-time 
X Processors  product.  As  we  discussed  earlier,  this  measure  appropriately  evaluates  the 
efficiency  of  fault-tolerant  parallel  algorithms.  We  formally  define  S  zis  follows: 

Definition  2.3  Consider  a  computation  with  P  initial  processors  that  terminates  in 
parallel-time  r  after  completing  its  task  on  some  input  data  /  of  size  N  and  in  the 
presence  of  fail-stop  error  pattern  F.  If  Pi{I,F)  <  P  is  the  number  of  processors 
completing  an  instruction  at  step  i,  then  we  define  S{I,F,P)  as: 

5(/,P,P)=^Pi(/,P).D 

«=1 

We  now  define  available  processor  steps  5*  in  terms  of  S: 

Definition  2.4  A  P-processor  PRAM  algorithm  on  any  input  data  I  of  size  |7|  =  N 
and  in  the  presence  of  any  pattern  F  of  failures  of  size  |P|  <  Af  <  IV  uses  available 
processor  steps 

S*  =  Sfj  p  =  in^{5(/ ,  P,  P)}  .  □ 

From  the  definition  of  5,  we  immediately  observe  the  following  property: 

Property  2.5  Given  any  fault-tolerant  parallel  algorithm  that  uses  up  to  P  processors, 
on  inputs  of  size  N,  M  M\  <  P\  <  P2  <  P  and  M2  <  Pj  <  P,  then  S^,  j^^  p^  <  p^. 


2.5.  NO-RESTART  FAIL-STOP  CROW  PRAM 
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This  is  so  because  by  Definition  2.4,  when  the  smaller  number  of  processors  P\  is 
used,  we  maximize  over  the  failure  patterns  with  P2  —  P\  failures  at  time  zero. 

The  measure  5*  is  used  in  turn  to  define  the  notion  of  algorithm  robustness  that 
combines  fault  tolerance  and  efficiency: 

Definition  2.6  Let  T{N)  be  the  best  sequential  (RAM)  time  bound  known  for  N-size 
instances  of  a  problem.  We  say  that  a  parallel  algorithm  for  this  problem  is  a  robust 
parallel  algorithm  if:  for  any  input  I  of  size  N  and  for  any  number  of  initial  processors  P 
{I  <  P  <  N)  and  for  any  failure  pattern  F  of  size  M  with  at  least  one  surviving  processor 
(M  <  N),  this  algorithm  completes  its  task  and  it  has  5*  =  ^  T(N)log‘^  N, 

for  fixed  c,  c'.  □ 

Remark  2.3  Note  that  the  available  processor  steps  S*  defines  the  worst  case  efficiency 
of  a  computation.  In  some  cases  it  may  be  sufficient  for  a  computation  to  be  efficient 
with  high  probability,  and  allow  worst  case  inefficiency.  Such  an  approach  was  taken  in 
[59]  using  a  probabilistically  restricted  adversary.  In  an  analogous  fashion  it  is  possible 
to  modify  the  definition  of  “robustness”  based  on  the  failure  model  used. 

2.5.3  Discussion  of  the  technical  choices  made 

Fail-stop  errors  vs.  malicious  processor  behavior:  We  have  chosen  to  consider 
only  the  failure  models  where  the  processors  do  not  write  any  erroneous  or  maliciously 
incorrect  values  to  shared  memory.  While  malicious  processor  behavior  is  often  consid¬ 
ered  in  conjunction  with  message  based  systems,  it  makes  less  sense  to  consider  malicious 
behavior  in  tightly  coupled  shared  memory  systems.  This  is  because  even  a  single  faulty 
processor  has  the  potential  of  invalidating  the  results  of  a  computation  in  unit  time, 
and  because  in  a  parallel  system  all  processors  are  normally  “trusted”  agents,  and  so 
the  issues  of  security  are  not  applicable. 

Concurrent  writes  vs.  exclusive  writes:  The  choice  of  CRCW  (concurrent  read, 
concurrent  write)  is  justified  in  the  discussion  of  lower  bounds  in  Chapter  4,  where  a 
simple  result  shows  that  the  CREW  (concurrent  read,  exclusive  write)  model  does  not 
admit  fault- tolerant  efficient  algorithms. 

Clear  vs.  contaminated  initial  memory:  We  require  that  a  linear  amount  of  shared 
memory  location  be  initially  clear,  i.e.,  initialized  to  zero.  While  this  is  consistent  with 
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definitions  of  PRAM  such  as  [44],  it  is  nevertheless  a  requirement  that  fault-tolerant 
systems  ought  to  be  able  to  do  without.  We  a^ddress  this  issue  in  Section  6.1  where  we 
develop  an  efficient  procedure  that  solves  the  Write- All  problem  even  when  the  shared 
memory  is  contaminated,  i.e.,  contains  arbitrary  values. 

Atomicity  and  word  size:  Thus  far,  the  model  we  have  defined  assumes  the  ability  to 
perform  log  TV-bit  word  parallel  writes  atomically.  That  is  the  model  allows;  (1)  logAT- 
bit  words  to  be  written  in  unit  time,  and  (2)  the  adversary  could  cause  failures  either 
before  or  after  the  write  cycle  of  the  PRAM,  but  not  during  the  write  cycle.  The 
algorithms  in  these  models  can  be  modified  so  that  these  two  restrictions  are  relaxed. 

The  new  definition  of  atomicity  becomes:  (1)  logA-size  words  are  written  using 
log  jV  bit  write  cycles,  and  (2)  the  adversary  can  cause  arbitrary  fail-stop  errors  either 
before  or  after  the  single  bit  write  cycle  of  the  PRAM,  but  not  during  the  bit  write 
cycle.  We  formally  show  this  in  Section  6.2. 


2.6  Restartable  Fail-stop  CRCW  PRAM 

We  begin  with  the  PRAM  model  given  in  Section  2.1.  This  PRAM  model  is  first 
augmented  with  the  concept  of  an  update  cycle. 

In  all  our  algorithms  for  the  restartable  fail-stop  PRAM: 

•  The  PRAM  processors  execute  sequences  of  instructions  that  are  grouped  in  update 
cycles.  Each  update  cycle  consists  of  reading  a  small  fixed  number  of  shared 
memory  cells  (e.g.,  <  4),  performing  some  fixed  time  computation,  and  writing  a 
small  fixed  number  of  shared  memory  cells  (e.g.,  <  2). 


The  parameters  of  the  update  cycle  (number  of  read  and  write  instructions)  are 
fixed,  but  depend  on  the  instruction  set  of  the  PRAM.  The  values  quoted  (4  and  2)  are 
sufficient  for  our  exposition. 

We  next  define  the  model  of  failures  and  restarts  model,  and  two  natural  complexity 


measures. 


2.6.  RESTARTABLE  FAIL-STOP  CROW  PRAM 
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2.6.1  Failure  and  restart  model 

We  use  the  fail-stop  with  restart  failure  model,  where  time  instances  are  the  PRAM 
clock-ticks: 


1.  A  failure  pattern  F  (i.e.,  failures  and  restarts)  is  determined  by  an  on-line  adver¬ 
sary,  that  knows  everything  about  the  algorithm  and  is  unknown  to  the  algorithm. 

2.  Any  processor  may  fail  at  any  time  during  any  update  cycle,  or  having  failed  it 
may  restart  at  any  time,  provided  that: 

(i)  at  any  time  during  the  computation  at  least  one  processor  is  executing  an 
update  cycle  that  successfully  completes,  and 

(ii)  failures  can  occur  before  or  after  a  write  of  a  single  bit  but  not  during  the 
write,  i.e.,  bit  writes  are  atomic  (see  Remark  2.5  below). 

3.  Failures  do  not  affect  the  shared  memory,  but  the  failed  processors  lose  their 
private  memory.  Processors  are  restarted  at  their  initial  state  with  their  PlD  as 
their  only  knowledge. 


The  failure  and  restart  patterns  are  formally  defined  as  foUows: 


Definition  2.7  A  failure  pattern  F  is  a  set  of  triples  <tag,  PID,  t  >  where  tag  is  either 
failure  indicating  processor  failure,  or  restart  indicating  a  processor  restart,  PID  is 
the  processor  identifier,  and  t  is  the  time  indicating  when  the  processor  stops  or  restarts. 


Remark  2.4  The  failures  and  restarts  we  are  considering  are  different  from  the  errors 
of  omission,  e.g.,  where  processors  may  skip  a  step  but  preserve  their  local  context. 


Remark  2.5  For  simplicity  of  presentation,  we  assume  that  the  PRAM  shared  memory 
writes  of  0(logmax{A,  F})  bit  words  are  atomic.  Algorithms  using  this  assumption  can 
be  easily  converted  to  use  only  single  bit  atomic  writes  as  we  show  in  Section  6.2. 
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2.6.2  Measures  of  efficiency:  completed  work  and  overhead  ratio 

We  investigate  two  natural  complexity  measures,  completed  work  and  overhead  ra¬ 
tio.  The  completed  work  measure  generalizes  the  standard  Parallel-time  x  Processors 
product  and  the  available  processor  steps  of  Definition  2.4.  The  overhead  ratio  is  an 
amortized  measure. 

Definition  2.8  Consider  an  algorithm  with  P  initial  processors  that  terminates  in 
parallel-time  r  after  completing  its  task  on  some  input  data  /  and  in  the  presence  of 
a  failure  pattern  F.  If  Pi{I,  F)  <  P  is  the  number  of  processors  completing  an  update 
cycle  at  time  i,  and  c  is  the  time  required  to  complete  one  update  cycle,  then  we  define 
Sy,  F,  P)  as; 

sy,F,p)  =  cj2Piii^n  ° 

t=l 


Definition  2.9  A  P-processor  PRAM  algorithm  on  any  input  data  7  of  size  |/|  =  N 
and  in  the  presence  of  any  pattern  F  of  failures  and  restarts  of  size  |F|  <  M: 

(i)  uses  completed  work  S"^  =  S^  j^  p  =  max{5'(7,  F,  P)}  and 

’  *  I  jF* 


(ii)  has  overhead  ratio  a  =  =  max 

l,F 


(SV,F,P)] 


□ 


Remark  2.6  Update  cycles  are  units  of  accounting.  They  do  not  constrain  the  in¬ 
struction  set  of  the  PRAM  and  failures  can  occur  between  the  instructions  of  an  update 
cycle.  However,  note  that  in  S'{I,  F,  P)  the  processors  are  not  charged  for  the  read  and 
write  instructions  of  update  cycles  that  are  not  completed. 


Remark  2.7  For  the  fail-stop  no-restart  execution  of  restartable  algorithms,  the  mea¬ 
sures  5*  and  5‘*'  are  equal  asymptotically.  When  the  restarts  do  not  occur,  then  the 
maximum  work  spent  in  the  incomplete  cycles  is  bounded  by  0{P),  since  there  can  be 
no  more  than  P  failures.  Therefore,  for  the  fail-stop  no-restart  model,  using  the  work 
5*  yields  the  same  results  as  using  the  S'*’  measure.  The  only  difference  is  that  5* 
accounts  abstract  instructions,  while  S'*"  accounts  update  cycles  that  might  contain  a 
small  constant  number  of  instructions. 


2.6.  RESTARTABLE  FAIL-STOP  CROW  PRAM 
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Remark  2.8  Consider  the  definition  of  work  S{I,F,P)  (Definition  2.3)  that  accounts 
for  the  incomplete  update  cycles.  Clearly  S{I,F,P)  <  S'{I,F,P)  +  c|F|.  Thus,  using 
S  does  affect  asymptotically  the  me^su'e  of  work  (when  |F|  is  very  large),  but  it  does 
not  asymptotically  affect  a  is  given  in  Definition  2.9(ii). 

Remark  2.9  One  might  also  generalize  the  overhead  ratio  in  terms  of  >  'vhere 

T(/)  is  the  time  complexity  of  the  best  sequential  solution  known  to  date  for  the  par¬ 
ticular  problem  at  hand.  For  the  purposes  of  this  exposition,  it  is  sufficient  to  express  er 
in  terms  of  the  ratio  This  is  because  for  the  Write- All  problem  (by  itself  and 

as  used  in  the  general  simulation)  T(/)  =  0(|/|). 

Remark  2.10  Another  way  to  generalize  the  overhead  ratio  is  in  terms  of  (/)7p’+|^ < 
where  Tp{I)  is  the  parallel  time  complexity  of  the  best  P-processor  solution  known  to 
date  for  the  particular  problem  at  hand.  Again,  for  the  purposes  of  this  exposition,  it  is 
sufficient  to  express  a  in  terms  of  the  ratio  This  is  because  for  the  Write- All 

problem  (by  itself  and  as  used  in  the  general  simulation)  Tp{I)  ■  P  =  0(|/|). 

2.6.3  Discussion  of  the  technical  choices  made 

Work  vs.  overhead  ratio:  When  dealing  with  arbitrary  processor  failures  and 

restarts,  the  completed  work  measure  S'^  depends  on  the  size  N  of  the  input  7,  the 
number  of  processors  P,  and  the  size  of  failure  pattern  F.  The  ultimate  performance  goal 
for  a  parallel  fault-tolerant  algorithm  is  to  be  able  to  perform  the  required  computation 
at  a  work  cost  as  close  as  possible  to  the  work  performed  by  the  best  sequential  algorithm 
known.  Unfortunately,  this  goal  is  not  attainable  when  an  adversary  succeeds  in  causing 
too  many  processor  failures  during  a  computation. 

Example  2.3  Work  subject  to  a  large  number  of  recoveries:  Consider  a  Write-All 
solution,  where  it  takes  a  processor  one  instruction  to  recover  from  a  failure.  If  an 
adversary  inflicts  a  failure  pattern  F  with  the  number  of  failure/restarts  |P|  =  fl(A'^^‘'-e) 
for  f  >  0,  then  the  completed  work  will  be  and  thus  already  non-optimal 

and  potentially  large,  regardless  of  how  efficient  the  algorithm  is  otherwise.  Yet  the 
algorithm  may  be  extremely  efficient,  since  it  takes  only  one  instruction  to  handle  a 
failure.  □ 
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This  illustrates  the  need  for  a  measure  of  efficiency  that  is  sensitive  to  both  the  size 
of  the  input  N,  and  the  number  of  failures  and  restarts  M  =  |.  When  M  =  0{P) 

as  in  the  case  of  the  stop  failures  without  restarts,  S'*"  properly  describes  the  algorithm 
efficiency,  and  a  —  0{  However,  when  F  can  be  large  relative  to  N  and  P  (as 

is  the  case  when  restarts  are  allowed)  c  better  reflects  the  efficiency  of  a  fault-tolerant 
algorithm. 

Recall  from  Remark  2.9,  that  <t  is  insensitive  to  the  choice  of  5  or  S'  (and  to  using 
update  cycles)  as  a  measure  of  work.  However,  update  cycles  are  necessary  for  the 
following  reasons. 

Update  cycles  and  termination:  Our  failure  model  requires  that  at  any  time,  at  least 
one  processor  is  executing  an  update  cycle  that  completes.  (This  condition  subsumes 
the  condition  of  non-restartable  fail-stop  model  that  one  processor  does  not  fail  during 
the  computation).  This  requirement  is  formulated  in  terms  of  update  cycles  and  assures 
that  some  progress  is  made.  Without  it,  the  algorithms  may  not  terminate,  and  when 
they  do  terminate  the  work  may  be  unbounded.  Since  the  processors  lose  their  context 
after  a  failure,  they  have  to  read  something  to  regain  it.  Without  at  least  one  update 
cycle  completing,  the  adversary  can  force  the  PRAM  to  thrash  by  allowing  only  these 
reads  to  be  performed.  Similar  concerns  are  discussed  in  [90]. 

Update  cycles  as  a  unit  of  accounting:  In  our  definition  of  completed  work  we  only 
count  completed  update  cycles.  Even  if  the  progress  and  termination  of  a  computation 
is  assured  (by  always  completely  executing  at  least  one  update  cycle),  but  the  processors 
are  charged  for  incomplete  update  cycles,  the  work  5  (in  Remark  2.9)  of  any  algorithm 
that  simulates  a  single  N  processor  PRAM  step  is  at  least  il(P-N).  The  reason  for  this 
quadratic  behavior  is  the  following  simple  and  rather  uninteresting  thrashing  adversary. 

Example  2.4  Thrashing  adversary:  Let  ALG  be  any  algorithm  that  solves  the  Write- 
All  problem  under  the  arbitrary  failure/restart  model.  Consider  the  standard  PRAM 
read/compute/write  cycles  (if  processors  begin  writing  without  reading  a  simple  mod¬ 
ification  of  the  following  argument  leads  to  the  same  result).  A  thrashing  adversary 
allows  all  processors  to  perform  the  read  and  compute  instructions,  then  it  fails  all  but 
one  processor  for  the  write  operation.  The  adversary  then  restarts  all  failed  processors. 
Since  one  write  operation  is  performed  per  read/compute/write  cycle,  N  cycles  will 
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be  required  to  initialize  N  array  elements.  Each  of  the  P  processors  performs  Q{N) 
instructions  which  results  in  work  of  0(P  •  N).  □ 

By  charging  the  processors  only  for  the  completed  fixed  size  update  cycles,  and 
not  for  partially  completed  cycles,  we  do  not  charge  for  thrashing  adversaries.  It  is 
interesting  that  this  change  in  cost  measure  allows  sub-quadratic  solutions. 
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Chapter  3 


Write- All  Algorithms 


Demonstrating  the  existence  of  efficient  algorithms  for  the  Write-All  problem 
given  in  Definition  2.1  is  essential  for  the  general  simulation  we  develop  and  an¬ 
alyze  in  Chapter  5.  In  this  chapter  we  present  and  analyze  several  algorithms  for  the 
Write- All  problem  using  three  processor  allocation  paradigms. 

In  the  chapter  and  in  the  detailed  description  of  the  algorithms  in  Appendices  A 
and  B  we  assume  that  is  a  power  of  2.  Nonpowers  of  2  can  be  handled  using 
conventional  padding  techniques.  All  logarithms  are  to  the  base  2,  and  div  stands  for 
integer  division  with  truncation. 


3.1  Processor  Allocation  Paradigms 


Processor  allocation  is  often  the  key  problem  on  the  way  to  achieving  efficient  solutions. 
Processors  need  to  be  allocated  so  that  the  relative  processor  loads  are  balanced  ac¬ 
cording  to  appropriate  criteria.  We  present  three  processor  allocation  paradigms  used 
in  constructing  efficient  solutions  for  the  Write-All  problem. 

In  the  overview  of  the  processor  allocation  paradigms  and  algorithms  within  each 
paradigm  we  give  the  complexity  results  in  terms  of  N,  the  size  of  the  Write- All  array. 
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Global  allocation  paradigm 

The  allocation  of  processors  in  the  global  allocation  paradigm  is  performed  using  the 
knowledge  of  the  global  state  of  the  computation.  The  processors  compute  and  reduce 
the  information  that  is  in  turn  used  to  synchronize  and  allocate  processors.  We  present 
two  deterministic  algorithms  that  use  global  allocation  paradigm: 

Algorithm  W;  this  is  a  deterministic  fail-stop  no  restart  algorithm  for  which  5*  = 
0(N  log^  jV/ log  log  TV);  this  algorithm  has  an  optimal  range  of  processors  for  which  the 
work  of  the  algorithm  is  optimal  for  any  pattern  of  failures  for  P  <  TV/log^  TV. 

Algorithm  V :  this  is  an  algorithm  that  can  be  used  with  both  models;  in  the  non- 
restartable  model  it  has  S*  —  0(TV  log^  TV)  and  a  range  of  optimality  similar  to  that  of 
algorithm  W,  and  for  the  restartable  fail-stop  model  it  hzis  S"*"  =  0{N  log^  TV-l-Tlf  log  TV), 
where  M  is  the  size  of  the  failure  pattern  encountered  during  the  execution. 


Local  allocation  paradigm 

Here  processors  make  allocation  decisions  based  on  the  information  that  is  local  to  the 
processors,  or  that  is  immediately  available  in  constant  number  of  memory  accesses 
from  the  shared  data  structures.  No  global  synchronization  is  necessary,  and  therefore 
local  allocation  algorithms  can  also  be  used  with  asynchronous  systems  (as  shown  by 
Buss  et  al.  in  [27]).  We  present  and  analyze  a  deterministic  algorithm,  and  we  present 
(without  analysis)  two  randomized  algorithms  based  on  the  deterministic  algorithm: 

Algorithm  X :  this  is  a  deterministic  algorithm  that  can  be  used  in  both  the  fail- 
stop  and  restartable  models.  In  the  fail-stop  model  we  show  and  analyse  a  particular 
failure  scenario  (this  scenario  was  refined  by  Lopez-Ortiz  [68]  to  exhibit  the  known 
worst  fail-stop  work  for  algorithm  A").  For  the  restartable  model  we  provide  complete 
algorithm  analysis  and  show  that  for  any  pattern  of  failures  and  restarts  the  algorithm 
has  subquadratic  work  0(TV’  ®®)  and  efficient  overhead  ratio  tr  =  O(log^TV)  (when 
interleaved  with  algorithm  V). 

Algorithms  Xcoin  and  Xdie-  these  are  two  variations  of  algorithm  X  using  coin 
tossing  and  die  casting  respectively.  The  analysis  of  these  algorithms  is  an  open  question. 
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Hashed  allocation  paradigm 

Here  processors  are  allocated  in  a  hashed  fashion,  either  according  to  a  randomized 
scheme  or  using  a  deterministic  scheme  that  approximates  a  particular  randomized 
scheme.  Hashed  allocation  algorithms  can  be  used  in  both  restartable  and  non-restartable 
failure  models. 

Algorithm  Y :  this  is  an  efficient  determinization  of  a  randomized  algorithm  that 
was  defined  by  Anderson  and  Woll  in  [8].  We  present  this  algorithm  without  analysis. 
Some  experimental  work  suggests  that  the  algorithm  is  a  very  efficient  algorithm.  The 
analysis  of  algorithm  Y  is  stated  as  an  open  problem  that  shows  an  interesting  linkage 
between  group  theory,  combinatorics  and  multi-processor  scheduling. 


3.2  Global  Allocation  Paradigm 

This  section  consists  of  three  parts.  In  the  first  two  we  present  two  algorithms  using  the 
global  allocation  paradigm:  algorithm  W  and  algorithm  V.  In  the  final  part  we  define 
the  processor  allocation  monotonicity  property  and  show  that  these  two  algorithms 
have  this  property.  This  property  will  enable  us  to  develop  fault- tolerant  simulations 
for  PRIORITY  PRAMs  in  Chapter  5. 

3.2.1  Algorithm  W 

We  now  define  and  analyze  a  robust  parallel  algorithm  for  the  Write-All  problem  in  the 
fail-stop  no-restart  model.  We  call  it  algorithm  W.  This  solution  illustrates  the  notion 
of  robustness  in  the  fail-stop  model.  The  original  Parallel-time  x Processors  product, 
N,  is  increased  by  at  most  a  clog^  A/loglog  A  multiplicative  factor,  for  any  dynamic 
pattern  of  failures  with  at  least  one  surviving  processor.  Note  that  we  have  no  knowledge 
of  how  many,  when,  or  which  processors  will  fail. 

Algorithm  W  solution  for  the  Write-All  problem  is  based  on  a  parallel  loop  through 
(1)  a  failure  detecting  phase,  (2)  a  load  rescheduling  phase,  (3)  a  work  phase  where 
assignments  (a;[i]:=l)  are  performed,  and  (4)  a  phase  that  estimates  the  work  remaining 
and  controls  the  parallel  loop.  The  entire  algorithm  is  moderately  involved,  but  fairly 
modular.  Phases  1  and  4  involve  bottom  up  traversal  of  two  different  heaps  and  phase 


38 


CHAPTER  3.  WRITE-ALL  ALGORITHMS 


2  involves  a  top  down  traversal  of  these  heaps.  Algorithm  W  uses  the  ability  of  the 
PRAM  to  atomically  write  words  of  0{\ogN)  bits.  However  this  is  only  for  convenience 
of  presentation,  and  in  Section  6  we  remove  this  assumption. 

This  solution  is  simple  enough  to  capture  certain  engineering  intuitions  (e.g.,  the 
rescheduling  involves  divide-and-conquer)  and  to  be  easily  implementable  (e.g.,  we  in¬ 
clude  detailed  description  of  the  code  in  Appendix  A).  Proving  robustness  is  the  subject 
of  Section  3.2.  The  phases  of  the  algorithm  are  such  that  reasoning  about  the  failure 
patterns  involves  few  cases  and  the  algorithm  analysis  uses  recurrences  and  inequalities. 
By  exploiting  parallel  slackness  as  as  advocated  by  Valiant  [98],  and  using  a  slightly 
smaller  number  of  processors  (I  <  P  <  iV/(log^  JV  —  loglog  Alog  A),  where  N  is  the 
size  of  the  input  array)  we  show  that  Write-All  c&n  be  solved  optimally  with  S*  =  0(A). 

For  simplicity  of  presentation,  in  the  rest  of  this  section  we  assume  that  the  initial 
number  of  processors  P  is  A,  where  A  is  the  input  size.  Our  results  immediately  extend 
to  any  P  in  the  range  1, . . .,  A  by  assuming  that  the  algorithm  starts  with  A  processors, 
and  that  A  ~  P  processors  fail  prior  to  the  first  step  of  the  algorithm. 

Algorithm  W  definition 

Algorithm  IV  is  a  four  phase  iterative  algorithm.  It  uses  fuU  binary  trees  to  (1)  enu¬ 
merate  surviving  processors,  (2)  allocate  processors,  (3)  perform  work  (a:[i]  :=  1),  and 
(4)  measure  progress. 

Input:  Shared  array  x[l..A];  i[j]  =  0  for  1  <  t  <  A. 

Output:  Shared  array  x[l..A];  x[j]  =  1  for  1  <  i  <  A. 

Data-structures:  We  use  four  full  binary  trees,  each  of  size  2A  —  2,  stored  as  heaps 
in  shared  memory.  By  heap  h[1..2A  —  1]  we  mean  that  array  h  codes  a  full  binary  tree 
structure  by  using  h[i]  (i  =  1, . . .,  A  —  1)  as  an  internal  tree  node  with  the  left  child 
h[2i]  and  the  right  child  h[2i  1]. 

The  heaps  are  c[1..2A  -  1]  (for  processor  counting  and  aOocation),  cs[1..2A  -  1] 
(for  keeping  step  numbers),  d[1..2A  -  1]  (for  progress  counting)  and  a[1..2A  —  1]  (for 
top-down  auxiliary  accounting).  They  are  initially  0. 

The  input  is  in  shared  array  i[l..A],  where  the  A  elements  of  this  array  are  associ¬ 
ated  with  the  leaves  of  the  heaps  d  and  a.  Element  i[i]  is  associated  with  d[i  -I-  A  -  1] 
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01  forall  processors  PID=1..^  parbegin 

02  Phase  W3:  Visit  the  leaves  based  on  PID  to  perform  work  on  the  input  data. 

03  Phase  W4;  Traverse  the  d  heap  bottom  up  to  measure  progress. 

04  while  the  root  of  the  d  heap  is  not  N  do 

05  Phase  Wl;  Traverse  the  c,  cs  heaps  bottom  up  to  enumerate  processors. 

06  Phase  W2:  Traverse  the  d,  a,  c  heaps  top  down  to  reschedule  work. 

07  Phase  W3;  Perform  rescheduled  work  on  the  input  data. 

08  Phase  W4;  TVaverse  the  d  heap  bottom  up  to  measure  progress 

09  od 
10  parend 

Figure  3.1:  A  high  level  view  of  algorithm  W 

and  a[i  +  N  —  1],  where  \  <  i  <  N .  Similarly  processors  are  initially  associated  with 
the  leaves  of  the  heap  c,  such  that  processor  PID  is  associated  with  c[PID+iV  —  1]. 

Each  processor  uses  some  constant  amount  of  local  memory.  For  example,  this  local 
memory  may  be  used  to  perform  some  simple  arithmetic  computations.  Important 
local  variables  are  PID,  containing  the  initial  processor  identifier,  and  pn,  containing  a 
dynamically  changing  processor  number.  Note  that  PID’s  do  not  change  but  pn’s  do. 

Thus,  the  overall  memory  used  is  0{N  +  P)  and  the  data-structures  are  very  simple. 

Control-flow;  Due  to  the  omniscience  of  the  adversary,  we  employ  an  oblivious  itera¬ 
tive  approach  in  the  sense  that  the  pool  of  the  available  processors  is  treated  uniformly 
and  is  assign^-d  cveni>  to  the  tasks  that  need  to  be  done.  The  basic  idea  of  the  loop  is: 
(a)  For  failure  detection  use  bottom  up,  faist  parallel  summation  to  estimate  the  surviv¬ 
ing  processors  and  to  estimate  the  progress  they  have  made,  (b)  For  load  rescheduling 
use  a  top  down,  divide-and-conquer  strategy  based  on  the  estimate  of  progress  made. 
This  idea  is  realized  as  follows. 

The  algorithm  consists  of  the  parallel  /oop  given  in  Figure  1.  This  loop  is  performed, 
in  a  synchronous  way,  by  aU  processors  that  have  not  stopped.  It  consists  of  four  phases 
of  steps,  and  the  first  time  only  part  of  it  is  executed  (phases  W3  and  W4).  Of  course, 
processors  can  fail-stop  at  any  time  during  the  algorithm.  We  next  proceed  with  a  high 
level  description  of  the  phases,  and  then  provide  additional  details  with  examples. 

Phase  Wl  -  the  failure  detection /processor  enumeration  phase:  In  this  phase  all  pro¬ 
cessors  traverse  a  full  binary  tree  used  for  processor  counting  starting  with  the 
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leaves  associated  with  processor  identifiers  (PIDs)  and  finishing  at  the  root.  A 
version  of  the  standard  parallel  addition  algorithm  is  used  for  counting. 

Phase  W2  -  the  processor  allocation  phase:  Here,  the  processors  begin  at  the  root  of 
the  full  binary  tree  that  represents  the  progress  of  the  algorithm,  and  traverse  it 
starting  with  the  root  and  finishing  at  the  leaves  associated  with  the  unfinished 
work.  The  processors  are  allocated  in  a  divide-and-conquer  fashion  according  to 
the  hierarchy  of  the  progress  tree. 

Phase  W3  -  the  work  phase:  The  processors  now  perform  work  they  find  at  the  leaves 
they  reached  in  phase  W2. 

Phase  W4  -  the  progress  measurement  phase:  The  processors  begin  at  the  leaves  of 
the  progress  tree  where  they  ended  phase  W3  and  traverse  it  to  the  root  to  estimate 
the  progress  of  the  algorithm.  A  version  of  the  standard  parallel  addition  algorithm 
is  used  to  count  the  number  of  leaves  where  the  work  of  phase  W3  was  successfully 
done. 

Algorithm  W  technical  details: 

In  phase  Wl  each  processor  PID  traverses  heaps  c  and  cs  bottom  up  from  from  the 
location  PID+A  -  1.  The  0(log  A)  path  of  this  traversal  is  the  same  (static)  for  all  the 
loop  iterations.  As  processors  perform  this  traversal  they  calculate  an  overestimate  of 
the  surviving  processors.  This  is  done  using  a  standard  O(logA)  parallel-time  version 
of  a  CRCW  summation  algorithm.  Heap  c  holds  the  sums  and  heap  cs  the  timestamps 
(or  step  numbers)  for  the  current  loop  iteration.  This  allows  reusing  c  without  having 
to  initialize  it  each  time.  Also,  during  this  traversal  surviving  processors  calculate  new 
processor  numbers  pn  for  themselves,  based  on  the  same  sums.  Detailed  code  for  this 
procedure  is  given  in  Appendix  A. 2. 

Each  processor  PID  starts  by  writing  a  1  in  the  leaf  c[PID-l- A  —  1]  of  the  tree  c.  If  a 
processor  fails  before  it  writes  1  then  its  action  will  not  contribute  to  the  overall  count. 
If  a  processor  fails  after  it  writes  1  then  this  number  can  still  contribute  to  the  overall 
sum  if  one  or  more  processors  were  active  at  a  sibling  tree  node  and  remained  active  as 
they  moved  to  the  ancestor  tree  node.  The  same  observation  applies  to  counts  written 
subsequently  at  internal  nodes,  which  are  the  sums  of  the  counts  of  the  children  nodes 
in  tree  c. 
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Example  3.1  Processor  enumeration:  Consider  phase  W1  for  iV  =  4.  There  are  4  pro¬ 
cessors  with  PIDs  1,  2,  3,  and  4,  and  the  counting  tree  is  represented  as  the  heap  c[1..7]. 
If  processor  1  failed  prior  to  the  start  of  phase  W 1 ,  pro¬ 
cessor  3  failed  right  after  writing  1  into  its  leaf  c[6],  and 
processor  4  failed  after  calculating  c[3]  =  2  as  the  sum  of 
its  (c[4]  =  1)  and  processor  3’s  (c[3]  =1)  contributions, 
then  the  heap  will  look  like  this  after  the  completion  of 
the  phase.  Observe  that  the  root  value  c[l]  =  3  yet  the  actual  number  of  active  proces¬ 
sors  is  1.  □ 

It  is  easy  to  show  that  phase  Wl  will  always  compute  in  c[l]  an  overestimate  of  the 
number  of  processors,  which  are  surviving  at  the  time  of  its  completion  (see  Lemma  3.1). 

We  also  need  lo  enumerate  the  surviving  processors.  This  is  accomplished  by  each 
processor  assuming  that  it  is  the  only  one,  and  then  adding  the  number  of  the  surviving 
processors  it  estimates  to  its  left.  This  enumeration  creates  the  dynamic  processor 
number  pn. 

Finally,  in  phase  Wl  we  must  be  able  to  reuse  our  heap  several  times.  This  presents  a 
problem.  For  example,  if  a  processor  had  written  1  into  its  heap  leaf  and  then  failed  then 
the  value  1  will  remain  there  for  the  duration  of  the  computation,  thus  preventing  us 
from  computing  monotonically  tighter  estimates  of  the  number  of  surviving  processors. 
This  is  corrected  by  associating  a  step  number  with  each  node  of  the  count  heap  c  and 
storing  it  in  heap  cs,  thus  time  stamping  valid  data.  The  count  steo  is  initiaUy  zero, 
and  during  each  successive  loop  iteration,  gets  incremented  by  each  ourviving  processor. 
Failed  processors  will  not  increment  their  step  numbers,  thus  enabling  the  surviving 
processors  to  detect  counts  that  are  out-of-date  and  treat  them  as  zeroes.  We  need  not 
worry  about  time  stamping  overflow,  since  we  have  words  of  O(loglV)  bits  and  in  the 
worst  case  the  loop  iterates  N  times  (see  Lemma  3.3). 

In  phase  W2  all  surviving  processors  start  at  the  root  of  the  progress  tree  d.  In  d[t] 
there  is  an  underestimate  of  the  work  already  performed  in  the  subtree  deflned  by  t. 
Now  the  processors  traverse  d  top  down  and  get  rescheduled  dynamically  according  to 
the  work  remaining  to  be  done  in  the  subtrees  of  i. 

It  is  essential  to  balance  the  work  loads  of  the  surviving  processors.  In  the  next 
section,  we  formally  show  that  the  algorithm  meets  the  goal  of  balancing  (Lemma  3.2). 


c[l]:  Ej 

c[2,3]:  I  1  I2I 

c[4,5,6,7]:  I  0  I  1  I  1  I  1 

PIP:  12  3  4 
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Although  the  divide-and-conquer  idea  based  on  d  is  sound,  some  care  has  to  be  put  into 
its  implementation. 

In  the  remaining  discussion  of  phase  W2  we  explain  our  implementation,  which  is 
based  on  auxiliary  progress  tree  a.  The  values  in  a  are  defined  from  the  values  in  d. 
AU  values  in  a  are  defined  given  d,  although  only  part  of  a  is  actually  computed.  The 
important  points  are  that  (i)  a  represents  the  progress  made  and  fully  recorded  from 
leaves  to  the  root,  and  (ii)  the  value  of  each  a[t]  is  defined  based  only  on  the  values  of 
d  seen  along  the  unique  path  from  the  root  to  the  node  i. 

At  each  internal  node  t,  the  processors  are  divided  between  the  left  and  right  subtrees 
in  proportion  to  the  leaves  that  either  have  not  been  visited  or  whose  visitation  was  not 
fully  recorded  in  d.  This  is  accomplished  by  computing  a[22],a[2t+  1]  and  using  these 
values  instead  of  d[2i],d[2t  +  1]  in  order  to  discard  partially  recorded  progress  (caused 
by  failures  and  recorded  by  the  processors  in  the  dynamic  bottom  up  traversal  of  d 
only  part  way  to  the  root).  We  detect  partially  recorded  progress  in  d  when  a  value  of 
an  internal  node  in  d  is  less  than  the  sum  of  the  values  of  its  two  descendants.  Thus, 
at  i,  after  computing  the  values  a[2i],a[2t+  1]  the  scheduling  of  work  is  done  using 
divide-and-conquer  according  to  the  values  N  —  a[2tj  and  N  —  a[2z  -f-  1]. 

Formally,  the  nonnegative  integer  values  in  a  are  constrained  top  down  as  follows: 

The  root  value  is  a[l]  =  d[l].  For  the  children  of  an  interior  node  t  (1  <  z  <  A'  —  1) 
we  have  a[2z]  <  fl![2z]  ,  a[2z  -t-  1]  <  d[2i  -|-  1]  ,  and  a[2z]  -f-  a[2z  +  1]  =  a[z] 

These  constraints  do  not  uniquely  define  a.  However,  we  realize  a  unique  definition 
by  making  a[2z]  and  a[2z  -f  1]  proportional  (up  to  round-off)  to  the  values  d[2z]  and 
d[2i  -H  1].  Thus,  our  dynamic  top-down  traversal  (given  in  detail  in  Appendix  A.4) 
implements  one  way  of  uniquely  defining  the  values  of  a  satisfying  these  constraints. 

The  constraints  on  the  values  of  a  assure  that  (i)  there  are  exactly  d[l]  =  a[l]  number 
of  leaves  whose  d  and  a  values  are  1  —  such  leaves  are  called  accounted,  and  no  processor 
will  reach  these  leaves,  and  (ii)  the  processors  reach  leaves  with  the  a  values  of  0  — 
such  leaves  are  called  unaccounted.  Also  see  Example  3.2  below  for  additional  intuition 
on  a. 

Remark  3.1  Strictly  speaking,  the  auxiliary  progress  tree  a  need  not  be  represented 
as  a  shared  heap.  Since  the  values  of  the  a  heap  are  computable  from  the  d  heap,  it  is 
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sufficient  for  each  processor  to  have  three  local  scalar  variables  to  represent  a  node  and 
its  two  descendants  in  a  “virtual”  heap  a.  However  it  is  convenient  to  use  a  as  defined 
above  in  the  proofs  of  the  next  section.  In  any  case,  a  linear  amount  of  storage  is  used. 

In  phase  W3  all  processors  are  at  the  leaves  reached  in  phase  W2.  Each  processor 
writes  1  in  the  array  element  associated  with  the  leaf  it  has  been  rescheduled  to.  Prior 
to  the  start  of  the  first  iteration  of  the  loop  each  processor  PID  tries  to  write  in  location 
x[PlD].  phase  W3  is  where  the  work  of  the  original  non-robust  algorithm  gets  done.  It 
is  contained  within  procedure  Main  in  Appendix  A.l. 

In  phase  W4  the  processors  record  the  progress  made  by  traversing  the  d  heap  bottom 
up  and  using  the  standard  summation  method.  The  O(logAr)  paths  (dynamically) 
traversed  by  processors  can  differ  in  each  loop  iteration,  since  processors  start  from  the 
leaves  where  they  were  in  phase  W3.  What  is  computed  each  time  is  an  underestimate 
of  the  progress  made.  No  timestamps  are  needed  here  because  the  progress  recorded 
increases  monotonically.  This  dynamic  bottom  up  traversal  is  given  in  Appendix  A.3. 

Phase  W4  is  a  simple  variant  of  phase  W1 ,  except  for  the  fact  that  the  path  traversed 
bottom  up  is  dynamically  determined.  One  can  easily  show  that  the  progress  recorded  in 
d[l]  by  phase  W4  increases  monotonically  dnid  it  underestimates  the  actual  progress  (see 
Lemma  3.3).  This  guarantees  that  the  algorithm  terminates  after  at  most  N  iterations, 
since  d[l]  /  A  is  the  guard  that  controls  the  main  loop. 

The  following  example  illustrates  phase  W4,  and  provides  intuition  for  why  the  heap 
a  is  used  in  phase  W2  and  why  it  is  needed  by  the  proof  framework  presented  in  the 
next  section. 

Example  3.2  Progress  estimation:  Consider  phase  W4  for  N  =  4.  There  are  4  proces¬ 
sors  with  PIDs  1,  2,  3,  and  4,  and  the  progress  tree  is  represented  as  the  heap  d[1..7]. 
If,  during  a  phase  W4  bottom-up  traversal  of  the  progress 
heap  d,  processor  4  failed  prior  to  the  start  of  the  phase, 
and  processor  3  failed  after  the  first  step  of  the  traversal 
having  written  1  into  the  leaf  d[6],  then  the  d  heap  will 
look  like  this  after  the  completion  of  the  phase. 

Let  P'  =  2  be  the  number  of  surviving  processors.  We  see  that  d[l]  =  2  is  an  underes¬ 
timate  of  the  actual  number,  i.e.  3,  of  visited  leaves.  If  the  d  heap  is  used  directly  in 


d[l]:  Ej 

d[2,3]:  1  2  1  0  I 

rf[4,5,6,7]: I  1  I  1  I  1  I  0 
PID:  12  3  4 
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phase  W2  to  allocate  processors  to  the  unvisited  leaves,  then  the  leaf  associated  with 
d[7l  will  be  allocated  all  P'  surviving  processors.  On  the  other  hand,  by  knowing  the 
(overestimate)  number  of  surviving  processors  P'  and  the  (underestimate)  of  the  visited 
leaves  d[l],  we  would  like  to  prove  that  the  allocation  is  balanced,  and  that  no  leaf  is 
allocated  more  than  \P'I{N  -  d[l])]  =  1  processors.  We  use  the  heap  a  in  phase  W2, 
where  the  surviving  processors  compute  a[6]  =  a[7]  =  0,  with  each  reaching  a  distinct 
leaf  thus  assuring  balanced  processor  allocation.  □ 

Analysis  of  algorithm  W 

We  now  outline  the  proof  of  robustness  for  algorithm  W.  Lemma  3.1  shows  that  in  each 
loop  iteration,  the  algorithm  computes  (over)estimates  of  the  remaining  processors.  In 
Lemma  3.2  we  prove  that  processors  are  only  allocated  to  the  unaccounted  leaves,  and 
that  all  such  leaves  are  allocated  a  balanced  number  of  processors.  Lemma  3.3  assures 
monotonic  progress  of  the  computation,  and  thus  its  termination.  In  Lemma  3.4  we 
develop  an  upper  bound  on  the  work  performed  by  the  processors  prior  to  the  algorithm 
termination.  Lemmas  3.1  and  3.3  are  proved  using  simple  inductions  on  the  structure 
of  the  heaps  used  by  the  algorithm.  Lemma  3.2  is  shown  by  using  an  invariant  for  the 
algorithm  of  phase  W2.  Lemma  3.4  is  the  central  lemma  of  this  section  and  its  proof 
consists  of  a  relatively  involved  induction  on  the  size  of  the  estimated  work  remaining 
at  some  step  of  the  algorithm. 

These  lemmas  are  used  to  show  the  main  Theorem  3.5,  and,  by  exploiting  parallel 
slackness,  we  obtain  the  optimality  result  Theorem  3.7. 

We  first  introduce  some  terminology.  Let  us  consider  the  t-th  iteration  of  the  loop 
(1  <  i  <  A').  Note  that  the  first  iteration  consists  only  of  phases  W3  and  W4.  Define: 
(1)  Ui  to  be  the  estimated  remaining  work,  the  value  of  A  — d[l]  right  before  the  iteration 
starts,  i.e.  right  after  phase  W4  of  the  previous  iteration  {U\  is  A);  (2)  Pi  to  be  the  real 
number  of  surviving  processors,  right  before  the  iteration  starts,  i.c.  right  after  phase 
W4  of  the  previous  iteration  (Pi  is  P);  (3)  R,  to  be  the  estimated  number  of  surviving 
processors,  that  is  the  value  of  c[l]  right  after  phase  W1  of  the  i-th  iteration  (Pi  is  P). 
The  following  is  shown  by  straightforward  induction  on  tree  c. 

Lemma  3.1  In  algorithm  W,  for  all  loop-iterations  i  we  have:  P,  >  P,  >  P,+] ,  as  long 
as  at  least  one  processor  survives. 
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Proof:  The  basis  P  —  Pi  =  Rj  >  P2  is  obvious.  P,  is  the  number  of  processors  active 
prior  to  the  first  PRAM  instruction  of  the  phase  W1  algorithm  for  static  bottom-up 
traversal.  By  the  definition  of  the  model,  we  immediately  have  P,  >  Pi+i-  We  will 
first  show  that  P,  >  P,  >  P,+i  by  induction  on  the  structure  of  the  c  tree  after  the 
completion  of  the  phase  W1  static  bottom-up  traversal.  For  simplicity  we  will  treat  the 
values  of  the  c  tree  with  incorrect  cs  version  numbers  as  virtual  zeroes,  and  not  involve 
cs  tree  further  in  this  proof.  The  proof  will  involve  two  inductions:  one  to  show  the 
first  part  of  the  inequality,  and  one  for  the  second  part. 

(1)  Inequality  Pi  >  P,:  Let  s{t)  denote  the  number  of  processors  that  initiated  phase 
W1  of  the  algorithm  in  the  subtree  of  the  c  tree  rooted  at  node  t.  Clearly,  we  have  that 

5(1)  =  P.. 

Basis;  For  all  subtrees  of  height  0  rooted  at  t,  c[t]  <  s{t),  because  some  processors 
may  have  stop-failed  after  the  initiation  of  phase  Wl,  but  before  the  initialization  of 
the  leaves  of  c  tree. 

Inductive  hypothesis:  assume  that  for  all  subtrees  of  height  h  rooted  at  nodes  t,  we 
have  c[t]  <  s(t). 

Inductive  step:  consider  nodes  t  of  height  h  +  1.  By  the  inductive  hypothesis: 
c[2t]  <  s{2t)  and  c[2t  +  1]  <  s{2t  -f-  1).  If  any  processor  reached  a  node  t,  then  c[t]  = 
c[2t]  -t-  c[2t  -t-  1]  <  s{2t)  -f  s{2t  -t-  1)  =  s(t).  If  no  processors  reached  the  node  t,  then 
c[t]  =  0  <  s{t). 

The  induction  stops  at  t  =  1  where  Pj  =  c[l]  <  s(l)  =  Pi,  and  so  Pi  >  Ri. 

(2)  Inequality  Ri  >  P,+]:  This  can  be  shown  using  similar  induction,  but  instead  of  s{t) 
we  define  r(t),  to  be  the  number  of  processors  that  initiated  phase  Wl  in  the  subtree 
of  the  node  t  and  that  completed  the  phase  Wl  traversal.  r(l)  is  the  upper  bound  for 
P,+i,  and  the  induction  will  show  that  r(l)  <  c[l].  □ 

In  the  dividing  done  during  the  dynamic  top  down  traversal  in  W,  we  will  allocate 
processors  to  tasks  that  either  have  not  been  completed,  or  have  been  completed,  but 
not  yet  accounted  for  at  the  root  d[l].  Recall  that  a  leaf  of  d  is  accounted  if  it  has  value  1 
and  if  the  corresponding  defined  value  in  the  leaf  of  heap  a  is  also  1  (there  are  exactly  d[l] 
accounted  leaves).  In  Algorithm  W,  the  processors  get  allocated  in  a  balanced  fashion 
to  the  unaccounted  leaves,  i.e.  the  leaves  whose  associated  (defined  and)  computed  value 
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in  heap  a  is  0.  The  next  lemma  shows  that  the  processors  allocation  to  the  unaccounted 
leaves  is  balanced.  It  involves  a  detailed  but  straightforward  assertional  proof.  Below 
we  give  the  lemma  with  a  proof  sketch,  and  its  full  proof  is  found  in  Appendix  A. 6. 

Lemma  3.2  In  phase  W2  of  each  loop-iteration  i  of  algorithm  W:  (1)  processors  are 
only  allocated  to  unaccounted  leaves,  and  (2)  no  leaf  is  allocated  more  than  {Ri/Ui] 
processors. 

Proof  sketch:  The  lemma  is  shown  by  proving  an  invariant  for  the  phase  W2  al¬ 
gorithm.  For  each  active  processor,  the  main  assertions  of  the  invariant  is  that  during 
the  top  down  traversal,  at  each  node  j  of  the  progress  tree  d  and  the  auxibary  progress 
tree  a:  (1)  a[j]  is  strictly  less  than  the  number  of  leaves  in  the  subtree  of  node  j, 
and  a[j]  <  d[j]^  and  (2)  the  maximum  number  of  active  processors  allocated  to  the 
progress  subtree  of  node  j  is  equal  (up  to  a  round-off)  to  Rillh  times  the  number  of 
unaccounted  leaves  in  that  subtree.  When  the  surviving  processors  reach  the  leaves, 
it  follows  from  the  invariant  that  a[j]  =  0,  i.e.,  the  leaf  is  unaccounted,  and  that  the 
number  of  processors  at  that  leaf  is  no  more  than  \Ri/Ui].  □ 

The  following  lemma  shows  that  for  each  loop-iteration,  the  number  of  unvisited 
leaves  is  decreasing  monotonically,  thus  assuring  termination  of  the  main  loop  after 
at  most  N  iterations.  The  worst  case  of  exactly  N  iterations  corresponds  to  a  single 
processor  surviving  at  the  outset  of  the  algorithm. 

Lemma  3.3  In  algorithm  W,  for  aU  loop-iterations  i  we  have:  Ui  >  Ut+i,  as  long  as 
at  least  one  processor  survives. 

Proof:  To  prove  this,  we  define  ai[1..21V-l)  and  fo  be  the  values  of  trees  a 

and  d  after  the  completion  of  iteration  i.  do[I--2A-l]  are  the  initial  zero  values.  We  first 
show  that  if  an  iteration  j  -)-  1  is  started  with  tree  d  satisfying  d,[t]  <  d,[2t]  -f  d,[2t  -t-  1] 
(1  <  <  <  A),  then  after  the  termination  of  loop-iteration  t  -f  1,  tree  d  will  satisfy 
di+\[A  <  d,+\[2t]  +  d  ,+i[2t  -|-  1]  (1  <  <  <  A),  and  along  a  path  completely  traversed 
from  leaf  to  root  by  a  processor  in  phase  W4:  ai[<]  <  (1  <  t  <  2N,  t  along  the 

path  traversed). 

This  can  be  shown  using  straightforward  induction  on  the  structure  of  the  tree  d 
and  the  loop-iteration  number  i.  From  this,  since  U,  =  A  -  d,[l]  =  A’  -  a,[l]  and 
U,+i  =  N  -  d,+][l],  we  have  A  -  U,  <  A  -  which  leads  to  the  desired  result.  □ 
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We  now  come  to  the  main  lemma.  We  will  treat  the  three  log  N  time  tree  traversals 
performed  by  a  single  processor  during  each  phase  of  the  algorithm  as  a  single  block-step 
of  cost  0(lcg  A^).  We  will  charge  each  processor  for  each  such  block  step,  regardless  of 
whether  the  processor  actually  completes  the  traversals  or  whether  it  fail-stops  some¬ 
where  in-between.  This  coarseness  will  not  distort  our  results;  since  we  can  have  at 
most  P  processor  failures  it  amounts  to  a  one  time  overcharge  of  (9(PlogW).  Let  us 
take  a  snapshot  of  the  algorithm  after  completion  of  several  loop-iterations.  We  are 
right  before  loop-iteration  i.  Vj  stands  for  the  total  number  of  block-steps  performed 
by  the  processors  in  trying  to  complete  all  remaining  work  (at  most  Ui). 


Lemma  3.4  For  any  failure  pattern  with  at  least  one  surviving  processor,  and  starting 
at  each  loop-iteration  t,  algorithm  W  completes  all  remaining  work.  Its  total  number 
of  block-steps  V,  is  less  than  or  equal  to  P,  +  Ui  -1-  P,  log(t/,),  where  1  <  P,,  Ui  <  N. 


Proof:  We  proceed  by  induction  on  the  size  of  P,.  For  the  base  case;  We  have  at  most 
one  unaccounted  leaf  and  some  number  of  processors  {Ui  =  1,P,  >  1).  As  long  as  at 
least  one  processor  survives,  we  are  going  to  visit  the  single  remaining  leaf  in  one  phase 
in  which  at  most  P,  processors  participate  and  Pi  <  Pt  +  1  -f-  Pilog(l). 

For  the  inductive  hypothesis:  we  assume  the  lemma  is  true  for  aU  Ui  <  P,  Pi  >  1, 
where  U  <  N .  We  will  then  prove  it  for  Pi  =  P,  Pi  >  1. 


We  divide  the  proof  in  two  cases:  (1)  as  many  unaccounted  leaves  at  least  as  pro¬ 
cessors,  i.e..  Pi  <  P,,  and  (2)  more  processors  than  unaccounted  leaves,  i.e.,  Pi  >  Ui. 

In  both  cases,  by  Lemma  3.2,  we  have  that  the  (accounted)  progress  for  iteration  i 
is  at  least  the  number  of  surviving  processors  P,+]  divided  by  \R,/Ui].  This  is  because 
each  one  of  these  processors  returns  to  the  root  d[l],  reporting  some  progress,  and  at 
most  \Rt/Ui]  processors  report  information  about  the  same  leaf. 


Also,  by  Lemma  3.1,  P,  >  Ri>  Pi+i,  and  we  can  assume  that  kPi  =  ,  for  some 

k  with  0  <  fc  <  1  (at  least  one  processor  survives).  Thus,  for  both  the  above  cases,  we 
have: 


P.+i  < 


fP.7P.l 


1  +  Pi/P. 


- ~) 

i  +  p./p.y 


For  the  case  ( 1 )  it  is  easy  to  see  that  we  wiU  have  at  most  one  processor  allocated  to 
each  unaccounted  leaf  so:  P.+j  <  P,  -  P,+i.  For  the  case  (2)  by  the  above  inequality 
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and  Pi  >  Ui  we  have  UiJ^\  <  i/i{l  -  A:/2).  Now  we  use  the  inductive  hypothesis  (but  for 
iteration  i  +  1)  in  both  cases. 

Case  (1):  The  survival  of  at  least  one  processor  and  f/,+i  <  {/,  —  imply  that  17, +i  < 
Ui-  The  total  work  (in  block-steps)  is  at  most  Pj-P  K-n  ,  where  by  the  hypothesis  Vi+i  < 
Pi+\  +  C.+i  -f  P,+i  log(l7,+i).  Thus,  it  suffices  to  show  that  P,+i  -P  -pP,+i  log(f/j+i) 
is  less  than  or  equal  to  Ui  -P  Pilog([/i).  This  is  trivial  given  UiJ^.\  <  Ui  —  P^+i  and 
Lemmas  3.1  and  3.3. 

Case  (2):  There  are  two  subcases.  If  A:  =  1  the  algorithm  completes  correctly  in  one 
iteration  and  the  work  P^  =  Ri  =  Pi^^  trivially  satisfies  the  Lemma.  The  second 
subcase  is  the  most  interesting  one  and  is  if  0  <  A:  <  1.  For  this  subcase  we  use 
Ui+\  <  Ui(l  -  kf2),  which  implies  C,+i  <  Ui.  As  in  case  (1),  the  total  work  (in  block- 
steps)  is  at  most  P,  -P  V,+i ,  where  by  the  hypothesis  Vj+i  <  p+i  -P  U,+i  -P  p+i  log(f7,+i). 
Thus,  it  suffices  to  show  that  P,+i  -P  Ui+i  -P  P,+i  log(P,+])  is  less  than  or  equal  to 
Ui  -p  P,  log(C,  ).  For  2  >  Ui+}  =  1  this  is  trivial. 

By  simple  manipulation  it  suffices  to  show  that  A'Pi-pP,(l-A:/2)-pA'P,log(Pi(l-A72)) 
is  less  than  or  equal  to  C,  -P  P,  log({/,).  This  is  equivalent  to  showing  that 

k  (l  -  ^)  +  k\og  (^1  -  <  (1  -  A:)logP, 

Recall  that  all  logarithms  are  base  2  and  therefore  (log(l/2)  =  -1).  Since  Ui  >  2  (C,  =  1 
was  taken  care  of  by  base  case)  we  have  logP,  >  1.  Also,  in  this  case  1  -  C,/2P,  <  1. 
It  thus  suffices  to  show  the  inequality 

A:log(2  -  A:)  <  (1  -  fc),  for  0  <  A:  <  1  (♦) 

Inequality  (*)  is  true  by  elementary  calculus  (it  is  tight  only  for  A-  =  1).  This  completes 
the  proof  of  the  second  subcase,  of  case  (2)  and  of  the  Lemma.  □ 

Remark  3.2  This  lemma  also  shows,  that  if  an  algorithm  existed  that  could  balance 
the  load  of  the  surviving  processors  and  allocate  them  in  constant  time,  then  the  Write- 
.41/  problem  could  be  solved  with  logarithmic  overhead  (logf/i  =  log  A).  The  lower 
bound  result  in  Section  4  provides  some  intuition  for  the  multiplicative  factor  log  (7 

Theorem  3.5  Algorithm  W  is  a  robust  parallel  algorithm  for  the  Write-All  problem 
with  5”  =  0(A'logA’  -P  Plog^  A),  where  A  is  the  input  array  size,  and  the  initial 
number  o7  irocessors  P  is  between  1  and  A. 
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Proof:  This  immediately  follows  from  Definition  2.6  and  Lemmas  3. 1-3.4.  Note  that 
although  we  assumed  N  processors  in  Algorithm  W,  we  only  used  the  fact  that  P  <  N 
in  the  lemmas.  In  fact,  as  indicated  earlier,  we  accommodate  P  <  N  processors  by 
considering  that  N  —  P  processors  failed  prior  to  the  beginning  of  the  algorithm.  This 
contributes  a  single  charge  of  0{N  —  P)  to  the  cost,  and  does  not  distort  the  asymptotic 
result  that  consists  of  the  product  of  the  total  block-steps  Vi  performed  by  the  algorithm 
since  the  first  iteration  of  the  algorithm  (inclusive)  times  the  per-block-step  cost  of 
O(logiV): 

5  =  Vi  •  0{\ogN)  =  {P\  +  Ui  +  Pi  log  U\)  •  O(log  A)  (using  Lemma  3.4) 

=  {P  +  N  +  Flog  N)  ■  0(log  A)  (using  Pi  =  P  and  Ui  =  A) 

=  0(FlogA  -I- AlogA-f  Plog2A)  =  0(AlogA-f-  Flog2  A)  .  □ 

One  immediate  observation  of  this  result  shows  that  fewer  processor  steps  will  be 
expended  by  the  algorithm  if  it  is  started  with  less  than  A  processors.  For  example  we 
reach  a  5*  =  C)(Alog  A)  bound  when  using  P  =  A/ log  A  processors.  A  question  can 
be  posed:  could  an  optimal  algorithm  for  the  Write-All  problem  be  constructed  using 
a  non-trivial  number  of  processors?  This  question  is  positively  answered  below. 

We  first  observe  that  each  block-step  takes  0(logA)  time  and  therefore  each  pro¬ 
cessor  can  be  asked  to  perform  0(log  A)  processing  steps  in  phase  W3  without  affecting 
the  asymptotic  complexity.  To  take  advantage  of  this,  we  parameterize  algorithm  IV  as 
follows; 

1.  Let  A  be  the  size  of  the  input. 

2.  Let  H  <  N  he  the  instance  size  for  the  algorithm,  thus  the  height  of  the  trees 
used  is  log  H . 

3.  Let  G  =  Nf  H  be  the  number  of  the  input  array  elements  mapped  to  each  leaf  of 
the  heaps. 

4.  Let  P  <  H  he  the  initial  number  of  processors. 

W^ith  these  data  structures,  the  performance  of  algorithm  W  is  described  by  the 
following  lemma; 
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Lemma  3.6  Algorithm  W  with  P  processors,  the  progress  tree  with  H  leaves  (P  <  H) 
and  2H  —  1  total  nodes  all  initialized  to  zero  and  G  array  elements  at  each  leaf,  has  the 
work  of  S*  =  0{{H  +  Flog  H)  •  (log  H  -f-  G))  for  any  pattern  of  stop  failures. 

Proof:  The  cost  of  a  single  block-step  Cg  is  0{]ogH  -f  G)  =  0{\ogH  -f  N/H).  By 
Lemma  3.4  the  algorithm  will  verifiably  visit  all  leaves  of  the  progress  heap  after  spend¬ 
ing  Vi  =  Pj  -f  Ui  -I-  Pi  log {/j  =  P  H  P\ogH  =  H  P\ogH  block-steps.  Therefore 
5*  =  Vi  •  Cfi,  and  so: 

5*  =  0{H  +  P\ogH)-  0(log  H  -I-  NjH)  =  0{H  log  H  +  Plog^  P  +  A  +  . 

□ 

To  achieve  work  optimality,  we  would  like  to  choose  the  parameters  in  the  lemma 
so  that  5*  =  0{N).  While  the  exact  solution  is  involved,  we  observe  that  the  following 
values  for  parameters  G,  H  and  P  produce  the  desired  result: 

G  =  log  A  ,  P  =  Nf  log  A  and  P  =  P/  log  P  =  Nf  {log^  A  -  log  A  log  log  A)  . 

Thus  by  exploiting  parallel  slackness,  we  achieve  work  optimality  using  a  number  of 
processors  smaller  than  N: 

Theorem  3.7  Parameterized  algorithm  W  with  log  N  array  elements  mapped  to  each 
leaf  of  the  progress  heap  is  a  robust  parallel  algorithm  that  solves  the  Write- All  problem 
of  size  N  with  5*  =  0{N),  when  P  <  N/{]og^  N  —  log  A’ log  log  TV). 

However,  as  we  show  in  the  chapter  on  lower  bounds,  no  optimal  A'-processor  algo¬ 
rithm  exists  for  Write-All. 

The  parameterized  algorithm  W  as  in  the  last  theorem  above  can  also  be  used  with 
any  number  of  processors  P  such  that  I  <  P  <  N .  When  using  P  processor  such  that 
P  >  it  is  sufficient  for  each  processor  to  take  its  PID  modulo  to  assure  a 

uniform  initial  assignment  of  at  least  [P/j^^J  and  no  more  than  processors 

to  a  work  element. 

Worst  case  adversary  for  algorithm  W 

We  are  going  to  show  in  Chapter  4  that  an  adversary  strategy  can  be  constructed 
so  that  any  Write-All  algorithm  with  P  =  N  processors  will  be  forced  to  perform 
^  ITglogA' )  (Theorem  4.4). 
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That  theorem  can  be  directly  utilized  to  produce  the  following  result: 

Theorem  3.8  There  is  a  processor  failure  pattern  for  algorithm  W  that  results  in 
5  =  Q{N  log2  N/  log  log  N)  ,  for  P  =  N. 

Proof:  This  is  accomplished  by  using  the  adversary  that  fail-stops  processors  according 
to  the  strategy  as  in  the  proof  of  the  Theorem  4.4,  except  that  instead  of  PRAM  steps, 
the  adversary  uses  block-steps,  and  the  processors  are  stopped  only  during  the  phase 
W3  where  the  operations  on  the  actual  data  take  place.  This  corresponds  to  the  fixed 
per  block-step  processor  survival  coefficient  k  defined  in  Lemma  3.4  being  equal  to 
1  -  1/log  A.  □ 

Using  a  slightly  different  strategy,  it  is  possible  to  construct  failure  patterns  that 
force  the  algorithm  to  take  Q(N  log  N  loglog  N)  steps  (this  example  is  due  to  Jeff 
Vitter).  This  is  done  by  utilizing  a  variable  per  block  processor  survival  coefficient 
k  =  1  ~  Ijs/Ul,  where  Ub,  is  the  underestimate  of  the  unvisited  leaves  in  Algorithm  W 
after  the  completion  of  block  iteration  b. 

It  was  shown  by  Martel  [71]  that  the  worst  case  performance  of  algorithm  W  is  no 
worse  than  S  =  Q{N\og^  N/loglogN).  We  state  this  result  here  and  give  its  proof  in 
Appendix  A. 6. 

Theorem  3.9  [71]  Algorithm  VU  is  a  robust  parallel  algorithm  for  the  Write-All  prob¬ 
lem  with  S"  =  0(i'Vlog^  7V/loglog  A^),  where  N  is  the  input  array  size,  and  the  initial 
number  of  processors  P  is  between  1  and  N. 

3.2.2  Algorithm  V 

Algorithm  W  in  the  previous  section  is  an  efficient  fail-stop  (no  restart)  Write-All 
solution.  It  has  efficient  completed  work  when  subjected  to  arbitrary  failure  patterns 
without  restarts.  It  can  be  extended  to  handle  processor  restarts  by  introducing  an 
iteration  counter,  and  having  the  revived  processors  wait  for  the  start  of  a  new  iteration. 
However,  this  algorithm  may  not  terminate  if  the  adversary  does  not  allow  any  of  the 
processors  that  were  alive  at  the  beginning  of  an  iteration  to  complete  that  iteration. 
Even  if  the  extended  algorithm  were  to  terminate,  its  completed  work  is  not  bounded 
by  a  function  of  N  and  P. 
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In  addition,  the  proof  framework  for  algorithm  W  does  not  easily  extend  to  include 
processor  restarts:  the  processor  enumeration  and  allocation  phases  become  inefficient 
and  possibly  incorrect,  since  no  accurate  estimates  of  active  processors  can  be  obtained 
when  the  adversary  can  revive  any  of  the  failed  processors  at  any  time. 

On  the  other  hand,  the  second  phase  of  algorithm  W  can  implement  processor  as¬ 
signment  (in  a  manner  similar  to  that  used  in  the  proof  of  Theorem  4.7)  in  O(logN) 
time  by  using  the  permanent  processor  PID  in  the  top-down  divide-and-conquer  allo¬ 
cation.  This  also  suggests  that  the  processor  enumeration  phase  of  algorithm  W  does 
not  improve  its  efficiency  when  processors  can  be  restarted. 

Therefore  we  present  a  modified  version  of  algorithm  W ,  that  we  call  V .  To  avoid  a 
complete  restatement  of  the  details  of  algorithm  W',  the  reader  is  urged  to  refer  to  the 
previous  section  (3.2.1). 


Definition  of  algorithm  V 

We  formulate  algorithm  V  using  the  data  structures  of  the  optimized  algorithm  W . 

Input:  Shared  array  x[f]  =  0  for  1  <  i  <  TV. 

Output:  Shared  array  x[l..jV];  i[f]  =  1  for  1  <  i  <  iV. 

Data-structures:  The  algorithm  uses  fuU  binary  trees  with  leaves  for  progress 
estimation  and  processor  allocation.  There  are  log  A'  array  elements  associated  with 
each  leaf  of  the  progress  tree.  Each  processor  instead  of  using  its  PID  during  the 
computation  uses  the  PID  modulo  When  the  number  of  processors  P  is  such  that 

P  >  this  assures  that  there  is  a  uniform  initial  assignment  of  at  least 

and  no  more  than  processors  to  the  work  elements  at  each  leaf. 

Control-flow:  Algorithm  V  is  an  iterative  algorithm  using  the  following  three  phases. 

Phase  VI  -  processor  allocation’.  Allocate  processors  using  PIDs  in  a  dynamic  top- 
down  traversal  of  the  progress  tree  to  assure  load  balancing  (0(log  A')  time). 

Phase  V2  -  work:  The  processors  now  perform  work  at  the  leaves  they  reached  in 
phase  VI  (there  are  log  N  array  elements  per  leaf). 
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01  forall  processors  PID=l..iV  parbegin 

02  Phase  V2;  Visit  the  leaves  based  on  PID  to  perform  work  on  the  i  nput  data. 

03  Phase  V3:  Traverse  the  d  heap  bottom  up  to  measure  progress. 

04  while  the  root  of  the  d  heap  is  not  N  do 

05  Ph2ise  VI;  Traverse  the  d,a,c  heaps  top  down  to  reschedule  work. 

06  Phase  V2:  Perform  rescheduled  work  on  the  input  data. 

07  Phase  V3:  Traverse  the  d  heap  bottom  up  to  measure  progres  s 

08  od 
09  parend 

Figure  3.2:  A  high  level  view  of  algorithm  V 

Phase  V3  -  progress  measurement:  The  processors  begin  at  the  leaves  of  the  progress 
tree  where  they  ended  phase  V2  and  update  the  progress  tree  dynamically,  bottom 
up  (O(log  N)  time). 


Processor  re-synchronization  after  a  failure  and  a  restart  is  an  important  implemen¬ 
tation  detail.  One  way  of  realizing  processor  re-synchronization  is  through  the  utilization 
of  an  iteration  wrap-around  counter  that  is  based  on  the  synchronous  PRAM  clock.  If 
a  processor  fails,  and  then  is  restarted,  it  waits  for  the  counter  wrap-around  to  rejoin 
the  computation.  The  point  at  which  the  counter  wraps  around  depends  on  the  length 
of  the  program  code,  but  it  is  fixed  at  “compile  time”. 


Analysis  of  algorithm  V 

We  now  analyze  the  performance  of  this  algorithm  first  in  the  fail-stop,  and  then  in  the 
fail-stop  and  restart  setting. 

Lemma  3.10  The  work  of  algorithm  V  using  P  <  N  processors  that  are  subject  to 
fail-stop  errors  without  restarts  is  5*  =  0{N  +  Plog*  N). 

Proof;  We  factor  out  any  work  that  is  wasted  due  to  failures  by  charging  this  work  to 
the  failures.  Since  the  failures  are  fail-stop,  there  can  be  at  most  P  failures,  and  each 
processor  that  fails  can  waste  at  most  O(log  N)  steps  corresponding  to  a  single  iteration 
of  the  algorithm.  Therefore  the  work  charged  to  the  failures  is  0(Plog  N),  and  it  will 
be  absorbed  by  the  rest  of  the  work. 
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We  next  evaluate  the  work  that  directly  contributes  to  the  progress  of  the  algorithm 
by  distinguishing  two  cases  below.  In  each  of  the  cases,  it  takes  0(log  =  O(log  iV) 
time  to  perform  processor  allocation,  and  C)(logiV)  time  to  perform  the  work  at  the 
leaves.  Thus  each  iteration  of  the  algorithm  takes  O(logiV)  time.  We  use  the  allocation 
technique  of  Theorem  4.7,  where  instead  of  reading  and  locally  processing  the  entire 
memory  at  unit  cost,  we  use  an  O(log  A'^)  time  iteration  for  processor  allocation. 


Case  I:  I  <  P  <  In  this  case,  at  most  1  processor  is  initially  allocated  to  each 

leaf.  As  in  the  proof  of  Theorem  4.7,  when  the  first  —  P  leaves  are  visited,  there 
is  no  more  than  one  processor  allocated  to  each  leaf  by  the  balanced  allocation  phase. 
When  the  remaining  P  or  less  leaves  are  visited,  the  work  is  0(Plog  P)  by  Theorem  4.7 
(not  counting  processor  allocation).  Each  leaf  visit  takes  0(log  A^)  work  steps;  therefore 
the  completed  work  is: 

S' =  0  ((j^  -  P  +  Plog  •  log  ivj  =  0(iV  +  P  log  P  log  AT)  =  0(iV  +  Plog2  N). 


Case  2:  <  P  <  In  this  case,  no  more  than  [P/iJ^l  processors  are  initially 

allocated  to  each  leaf.  Any  two  processors  that  are  initially  allocated  to  the  same  leaf, 
should  they  both  survive,  will  behave  identically  throughout  the  computation.  Therefore 
we  can  use  Theorem  4.7  with  the  processor  allocation  as  a  multiplicative 

factor.  From  this,  the  work  is: 


S’  = 


PI 


N 


\ogN 


O 


N 


log  N 


log  j^)  -OClogA’)  =  OCPlog^TV). 


The  results  of  the  two  cases  combine  to  yield  S'  =  0{N  +  Plog^  A^).  □ 


The  following  corollary  extracts  the  slightly  better  bound  analyzed  in  the  case  (1) 
above,  and  it  also  covers  the  processor  range  for  which  the  work  of  the  algorithm  is 
optimal. 


Corollary  3.11  The  work  of  algorithm  V  using  P  <  N/\ogN  processors  that  are 
subject  to  fail-stop  errors  without  restarts  is  S'  =  0{N  +  Plog  AlogP). 


The  upper  bound  analysis  is  tight: 

Theorem  3.12  There  is  a  fail-stop  adversary  that  causes  the  work  of  algorithm  V  to 
be  S'  =  f2(Plog^A’)  for  the  number  of  processors  N/logN  <  P  <  N ,  and  S'  = 
Q{  N  P log  A  log  P)  for  the  number  of  processors  1  <  P  <  A/ log  A. 
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Proof;  Consider  the  following  adversary  for  P  =  N/logN.  At  the  outset  the  adversary 
fail-stops  all  processors  that  are  initially  assigned  to  the,  say,  left  subtree  of  the  progress 
tree.  Let  the  number  of  unvisited  array  elements  he  U.  By  the  balanced  allocation 
lemma  (3.2)  the  N  processors  (dead  or  alive)  will  be  assigned  in  a  balanced  fashion  to  the 
left  and  right  segments  of  the  contiguous  U  unvisited  elements.  Initially,  Uq  is  N/logN, 
and  so  the  algorithm  will  terminate  in  log  Uo  =  ©(log  N)  block-steps  when  such  an 
adversary  is  encountered.  Each  block-step  takes  ©(log  N)  time  using  the  remaining  P/2 
processors.  Thus  the  work  is  5*  <  ^©(log  A'^)©(log  JV)  =  ©(Plog^  N)  =  fi(iVlogiV). 

When  P  is  larger  than  N/logN ,  then  each  leaf  is  allocated  at  least  [P/i3^J  and 
no  more  than  processors.  All  processors  allocated  to  the  same  leaf  have  their 

PIDs  equal  modulo  N/logN .  Therefore  the  work  is  increased  by  at  least  a  factor  of 
[P/j^^J  as  compared  to  the  case  P  =  N/logN.  I.e.,  S*  =  [P/jj^Jft(Alog  A)  = 
fI(Plog2  A). 

Finally,  when  P  <  A/ log  A,  the  result  follows  similarly  using  the  strategy  of  the 
case  (1)  of  Lemma  3.10.  □ 

The  following  theorem  expresses  the  completed  work  of  the  algorithm  in  the  presence 
of  restarts; 

Theorem  3.13  The  completed  work  of  algorithm  V  using  P  <  A  processors  subject  to 
an  arbitrary  failure  and  restart  pattern  F  of  size  M  is:  5''‘  =  0(A-t-Plog^  A-t-M  log  A). 

Proof:  The  proof  of  Lemma  3.10  does  not  rely  on  the  fact  that  in  the  absence  of 
restarts,  the  number  of  active  processors  is  non-increasing.  However,  the  lemma  does 
not  account  for  the  work  that  might  be  performed  by  processors  that  are  active  during 
a  part  of  an  iteration  but  do  not  contribute  to  the  progress  of  the  algorithm  due  to 
failures.  To  account  for  all  work,  we  are  going  to  charge  to  the  array  being  processed 
the  work  that  contributes  to  progress,  and  any  work  that  was  wasted  due  to  failures  will 
be  charged  to  the  failures  and  restarts.  Lemma  3.10  accounts  for  the  work  charged  to 
the  array.  Otherwise,  we  observe  that  a  processor  can  waste  no  more  than  0(log  A)  time 
steps  without  contributing  to  the  progress  due  to  a  failure  and/or  a  restart.  Therefore 
this  amount  of  wasted  work  is  bounded  by  O(MlogN).  This  proves  the  theorem.  (Note 
that  the  completed  work  5  of  F  is  small  for  smaU  |P|,  but  not  bounded  by  a  function 
of  P  and  A  for  large  |P|).  □ 
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Remark  3.3  Recall  that  when  the  failure  patterns  are  such  that  the  size  M  is  bounded 
by  P,  then  the  measures  S*  and  5"''  are  asymptotically  equal. 


3.2.3  Processor  Allocation  Monotonicity 

One  of  the  advantages  of  the  two  algorithms  that  we  presented  for  the  globed  alloca¬ 
tion  paradigm  is  that  both  algorithms  have  what  we  define  as  the  processor  allocation 
monotonicity  property; 


Definition  3.1  A  given  Write-All  algorithm  has  a  processor  allocation  monotonicity 
property  if  whenever  during  the  execution  of  any  step  of  the  algorithm  there  are  two  ar¬ 
ray  location  Cj  and  aj  (without  loss  of  generality  let  aj  <  02)  are  being  concurrently 
written  to  by  two  processors  with  PIDs  pi  and  p^  respectively,  then  pi  <  P2. 


This  property  becomes  important  when  parallel  algorithms  are  simulated  on  fault- 
prone  PRIORITY  PRAMs.  These  simulations  will  be  addressed  in  Chapter  5.  Neither 
the  local  allocation  algorithm  nor  the  hashed  allocation  edgorithm  presented  further  in 
this  chapter  have  this  property. 

To  show  that  algorithms  W  and  V  satisfy  the  processor  allocation  monotonicity 
property,  we  need  to  examine  the  processor  enumeration  and  aUocation  phases  of  al¬ 
gorithm  VP,  and  the  allocation  phase  of  algorithm  V.  In  the  enumeration  phase,  a 
surviving  processor  is  enumerated  by  adding  one  to  the  overestimate  of  surviving  pro¬ 
cessors  with  PIDs  smaller  than  that  processor’s  PID.  Thus  processors  are  enumerated 
monotonically:  larger  PIDs  are  given  larger  processor  numbers.  In  the  allocation  phase, 
the  enumerated  processors  are  assigned  in  a  divide-and-conquer  strategy  according  to 
a  binary  tree:  lower  (higher)  numbered  processors  are  assigned  to  the  subtrees  that 
contain  lower  (higher)  numbered  work  elements.  Therefore  the  processor  allocation  is 
monotonic.  The  same  holds  for  algorithm  V,  since  processors  always  have  their  perma¬ 
nent  PIDs  and  enumeration  is  not  used.  This  proves  the  following: 


Property  3.2  Algorithms  V  and  W  satisfy  the  processor  allocation  monotonicity. 
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3.3  Local  Allocation  Paradigm 

In  this  section  we  present  and  analyze  an  algorithm  that  can  be  used  in  both  the  non- 
restartable  and  restartable  models.  We  call  it  algorithm  X.  We  also  propose  (without 
analysis)  two  randomized  versions  of  this  algorithm  that  have  the  potential  of  being 
more  efficient  (in  the  expected  work  analysis)  than  algorithm  X,  even  when  subjected 
to  the  omniscient  adaptive  adversaries. 

3.3.1  Algorithm  X 

We  present  an  algorithm  for  the  Write-All  problem,  and  show  that  its  completed  work 
complexity  is  5  =  0(iV-P’"*  2 )  =  0{N •  i  using  P  <  N  processors  in  the  restartable 
fail-stop  model  of  computation.  The  important  property  of  X  is  that  it  has  bounded 
sub-quadratic  completed  work;  in  the  restartable  fail-stop  model,  this  is  independent  of 
the  failure  pattern.  If  a  very  large  number  of  failures  occurs,  say  |F|  =  (1{N  •  P'“*t), 
then  the  algorithm’s  overhead  ratio  a  becomes  optimal;  it  takes  a  fixed  number  of 
computing  steps  per  failure/recovery. 

Definition  of  algorithm  X 

Like  algorithm  V,  algorithm  X  utilizes  a  progress  tree  of  size  N ,  but  it  is  traversed  by 
the  processors  independently,  not  in  synchronized  phases.  This  reflects  the  local  nature 
of  the  processor  allocation  in  algorithm  X  as  opposed  to  the  global  allocation  used  in 
algorithms  V  and  W.  Each  processor,  acting  independently,  searches  for  work  in  the 
smallest  immediate  subtree  that  has  work  that  needs  to  be  done.  It  then  performs  the 
necessary  work,  and  moves  out  of  that  subtree  when  no  more  work  remains.  We  present 
the  algorithm  on  the  restartable  fail-stop  model. 

Input:  Shared  array  i[l..7V];  i[t]  =  0  for  1  <  t  <  A. 

Output:  Shared  array  x[l..iV];  x[i]  =  1  for  1  <  i  <  iV. 

Data-structures:  The  algorithm  uses  a  full  binary  tree  of  size  2N  -  1,  stored  as  a  heap 
d[\  . .  .2N  —  1]  in  shared  memory.  An  internal  tree  node  d[t]  (i  =  1, . . .,  A  -  1)  has  the 
left  child  d[2i]  and  the  right  child  d[2i  +  1].  The  tree  is  used  for  progress  evaluation  and 
processor  allocation.  The  values  stored  in  the  heap  are  initially  0. 
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01  forall  processors  P1D=0..P  —  1  parbegin 

02  Perform  initial  processor  assignment  to  the  leaves  of  the  progress  tree 

03  while  there  is  still  work  left  in  the  tree  do 

04  if  current  subtree  is  done  then  move  one  level  up 

05  elseif  this  is  a  leaf  then  perform  the  work  at  the  leaf 

06  elseif  this  is  an  interior  tree  node  then 

07  if  both  subtrees  are  done  then  update  the  tree  node 

08  elseif  only  one  is  done  then  go  to  the  one  that  is  not  done 

09  else  move  to  the  left/right  subtree  according  to  PID  bii  values 

10  fi 

11  fi 

12  od 

13  parend 

Figure  3.3:  A  high  level  view  of  algorithm  X. 

The  jV  elements  of  the  input  array  a;[l  ...N]  are  associated  with  the  leaves  of  the 
tree.  Element  i[t]  is  associated  with  d[i  +  AT  —  1],  where  1  <  t  <  A^.  The  algorithm  also 
utiUzes  an  array  w[Q..P  -  1]  that  is  used  to  store  individual  processor  locations  within 
the  progress  tree  d. 

Each  processor  uses  some  constant  amount  of  private  memory  to  perform  simple 
arithmetic  computations.  An  important  private  constant  is  PID,  containing  the  proces¬ 
sor’s  own  identifier. 

Thus,  the  overall  memory  used  is  0{N  -f  P)  and  the  data-structures  are  simple. 

Control-flow:  The  algorithm  consists  of  a  single  initialization  and  of  the  parallel  loop. 
A  high  level  view  of  the  algorithm  is  in  Figure  3.3;  all  line  numbers  refer  to  this  figure. 
More  detailed  code  can  be  found  in  Appendix  B. 

The  initialization  (line  02)  assigns  the  P  processors  to  the  leaves  of  the  progress 
tree  so  that  the  processors  are  assigned  to  the  first  P  leaves  by  storing  the  initial  leaf 
assignment  in  u’[PID].  The  loop  (lines  03-12)  consists  of  a  multi-way  decision  (lines  04- 
11).  If  the  current  node  is  marked  done,  the  processor  moves  up  the  tree  (line  04).  If 
the  processor  is  at  a  leaf,  it  performs  work  (line  05).  If  the  current  node  is  an  unmarked 
interior  node  and  both  of  its  subtrees  are  done,  the  interior  node  is  marked  by  changing 
its  value  from  0  to  1  (line  07).  If  a  single  subtree  is  not  done,  the  processor  moves  down 
appropriately  (line  08). 

For  the  final  case  (line  09),  the  processors  move  down  when  neither  child  is  done. 
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This  last  case  is  where  a  non-trivial  {italicized)  decision  is  made.  The  PID  of  the 
processor  is  used  at  depth  h  of  the  tree  node  based  on  the  value  of  the  most  significant 
bit  of  the  binary  representation  of  the  PID:  bit  0  will  send  the  processor  to  the  left,  and 
bit  1  to  the  right. 

Regardless  of  the  decision  made  by  a  processor  within  the  loop  body,  each  iteration  of 
the  body  consists  of  no  more  than  four  shared  memory  reads,  a  fixed  time  computation 
using  private  memory,  and  one  shared  memory  write  (see  Appendix  B  for  the  detailed 
algorithm).  Therefore  the  body  can  be  implemented  as  an  update  cycle. 

Example  3.3  Progress  tree  traversal:  Consider  algorithm  A'  for  =  P  =  8.  The 
progress  tree  d  of  size  2N  -  1  =  15  is  used  to  represent  the  full  binary  progress  tree  with 
eight  leaves.  The  8  processors  have  PIDs  in  the  range  0  through  7.  Their  initial  positions 
are  indicated  in  Figure  3.4  under  the  leaves  of  the  tree.  The  diagram  iUustrates  the 
state  of  a  computation  where  the  processors  were  subject  to  some  faiilures  and  restarts. 
Heavy  dots  indicate  nodes  whose  subtrees  are  finished.  The  paths  being  traversed  by 
the  processors  are  indicated  by  the  arrows.  Activ  e  processor  locations  (at  the  time  when 
the  snapshot  was  taken)  are  indicated  by  their  PIDs  in  brackets.  In  this  configuration, 
should  the  active  processors  complet.  the  next  cycle,  they  will  move  in  the  directions 
indicated  by  the  arrows:  processors  0  and  1  wiD  descend  to  the  left  and  right  respectively, 
processor  4  will  move  to  the  unvisited  leaf  to  its  right,  and  processors  6  and  7  will  move 
up.  □ 

Analysis  of  algorithm  A 

We  begin  by  showing  the  correctness  and  termination  of  algorithm  X  in  the  following 
simple  lemma. 
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Lemma  3.14  Algorithm  X  with  N  processors  •  a  correct,  terminating  and  fault- 
tolerant  solution  for  the  f-processor  Write- All  problem  of  size  N .  The  algorithm 
terminates  in  at  least  fl(log  A)  and  at  most  0{P  •  N)  time  steps. 

Proof:  We  first  observe  that  the  processor  loads  are  localized  in  the  sense  that  a 
processor  exhausts  all  work  in  the  vicinity  of  its  original  position  in  the  tree,  before 
moving  to  other  areas  of  the  tree.  If  a  processor  moves  up  out  of  a  subtree  then  all  the 
leaves  in  that  subtree  were  visited.  We  also  observe  that  it  takes  exactly  one  update 
cycle  to:  (i)  change  the  value  of  a  progress  tree  node  from  0  to  1,  (ii)  to  move  up  from 
a  (non  root)  node,  or  (iii)  to  move  down  left,  or  (iv)  down  right  from  a  (non  leaf)  node. 
Therefore,  given  any  node  of  the  progress  tree  and  any  processor,  the  processor  wiU  visit 
and  spend  exactly  one  complete  update  cycle  at  the  node  no  more  than  four  times. 

Since  there  are  2N  -  1  nodes  in  the  progress  tree,  any  processor  wiU  be  able  to 
execute  no  more  than  0(N)  completed  update  cycles.  If  there  are  P  processors,  then  aU 
processors  wiU  be  able  to  complete  no  more  than  0(P  ■  N)  update  cycles.  Furthermore, 
at  any  point  in  time,  there  is  at  least  one  update  cycle  that  will  complete.  Therefore 
it  will  take  no  more  than  0{P  •  N)  sequential  update  cycles  of  constant  size  for  the 
algorithm  to  terminate. 

Finally,  we  also  observe  that  all  paths  from  a  leaf  to  the  root  are  at  least  log  N  long, 
therefore  at  least  log  X  update  cycles  per  processor  will  be  required  for  the  algorithm 
to  terminate.  □ 

Now  we  proceed  to  the  main  work  lemma.  In  the  rest  of  this  section,  the  expression 
"S.wp"  denotes  the  completed  work  on  inputs  of  size  N  using  P  initial  processors  and 
for  any  failure  pattern.  Note  that  in  this  lemma  we  assume  P  >  N . 

Lemma  3.15  The  completed  work  of  algorithm  A’  for  the  Write-All  problem  of  size 
iV  with  P  >  N  initial  processors  and  for  any  pattern  of  failures  and  restarts  is  S\,p  = 
Ofp.  A-'"8§). 

Proof:  We  show  by  induction  on  the  height  of  the  progress  tree  that  there  are  positive 
constants  C] ,  C2.  C3  such  that  <  cj  F  •  -  C2Flog  A'  -  C3F. 

For  the  base  case:  we  have  a  tree  of  height  0  that  corresponds  to  an  input  array  of 
size  1  and  at  least  as  many  initial  processors  F.  Since  at  least  one  processor,  and  at 
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most  P  processors  wiU  be  active,  this  single  leaf  will  be  visited  in  a  constant  number 
of  steps.  Let  the  work  expended  be  c' P  for  some  constant  c'  that  depends  only  on  the 
lexical  structure  of  the  algorithm.  Therefore  5i,p  =  c'P  <  c^P  ■  l’“*2  —  C2P  •  0  —  C3P 
whv,  ‘  chosen  to  be  larger  than  or  equal  to  C3  +  c'. 

Now  consider  a  tree  of  height  logiV  (>  1).  The  root  has  two  subtrees  (left  and  right) 
of  height  log  -  1.  By  the  definition  of  algorithm  X ,  no  processor  will  leave  a  subtree 
until  the  subtree  is  marked-one,  i.e.,  the  value  of  the  root  of  the  subtree  is  changed 
from  0  to  1.  We  consider  the  following  sub-cases:  (1)  both  subtrees  are  marked-one 
simultaneously,  and  (2)  one  of  the  subtrees  is  marked-one  before  the  other. 


Cane  I:  If  both  subtrees  are  marked-one  simultaneously,  then  the  algorithm  will  termi¬ 
nate  after  the  two  independent  subtrees  terminate  plus  some  small  constant  number  of 
steps  c'  (when  a  processor  moves  to  the  root  and  determines  that  both  of  the  subtrees 
are  finished).  Both  the  work  Si  expended  in  the  left  subtree  of,  and  the  work  in 
the  right  subtree  are  bounded  by  5;v/2.p/2-  The  added  work  needed  for  the  algorithm 
to  terminate  is  at  most  c' P,  and  so  the  total  work  is: 


S  <  Si  +  Sfi  +  c' P  <  25;v/2,P/2  + 


<  2 


P  /A'\‘°8§ 


2  V  2  J 


P,  P\  ,0 

+c/> 


2  T 

=  C,-PA’'°*§  -  C2P\og—-  CyP-kc'P  <  C,P  - Af'°«5  -  CjPlogA'  -  C3P 
for  sufficiently  large  ci  and  any  C2  depending  on  c',  e.g.,  cj  >  3(c2  -I-  c'). 


Ca.'if  2:  .A.ssume  without  loss  of  generality  that  the  left  subtree  is  marked-one  first  with 
Si  =  S\j2,pi2  work  being  expended  in  this  subtree.  Any  active  processors  from  the  left 
subtree  will  start  moving  via  the  root  to  the  right  subtree.  The  path  traversed  by  any 
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processor  as  it  moves  to  the  right  subtree  after  the  left  subtree  is  finished  is  bounded  by 
the  maximum  path  length  from  a  leaf  to  another  leaf  c'  log  N  for  a  predefined  constant 
c'.  No  more  than  the  original  F/2  processors  of  the  left  subtree  will  move,  and  so  the 
work  of  moving  the  processors  is  bounded  by  c'(P/2)]ogN . 


We  observe  that  the  cost  of  an  execution  in  which  P  processors  begin  at  the  leaves 
of  a  tree  (with  N/2  leaves)  differs  from  the  cost  of  an  execution  where  P/2  processors 
start  at  the  leaves,  and  P/2  arrive  at  a  later  time  via  the  root,  by  no  more  than  the 
cost  c'{P/2)\ogN  accounted  for  above.  This  can  be  simply  shown  by  constructing  a 
scenario  in  which  the  second  set  of  P/2  processors  do  not  arrive  through  the  root,  but 
instead  start  their  execution  with  a  failure,  and  then  traverse  along  a  path  of  I’s  (if 
any)  in  the  progress  tree,  until  they  reach  a  0  node  that  is  either  a  leaf,  or  whose 
descendants  are  marked.  Having  vccounted  for  the  difference,  we  see  that  the  work  Sr 
to  complete  the  right  subtree  using  up  to  P  processors  is  bounded  by  5;v/2,P  (by  the 
definition  of  5,  if  P]  <  P2,  then  <  Sn,p2).  After  this,  each  processor  will  spend 

some  constant  number  of  steps  moving  to  the  root  and  terminating  the  algorithm.  This 
work  is  bounded  by  c"P  for  some  small  constant  c".  The  total  work  S  is: 

5  <  5i,  +  c'—  log  A  +  5fl  +  c"P  <  5yv/2,p/2  +  c'—  log  N  +  Sf^/2,p  +  c"P 

<  Clf  log^-C3^+c'f  logiV  +  CiP(fy  ®^-C2Plogf  -C3P  +  c"P 


=  .,PA'M-c,Plogiv(5-^)-„p(?-£ 


2C3) 


<CiP-iV’°8t  -C2P\ogN  -  C3P 


for  sufficiently  large  C2  and  C3  depending  on  fixed  c'  and  c",  e.g.,  C2  >  c'  and  C3  > 
3c2  +  2c". 


Since  the  constants  c',c"  depend  only  on  the  lexical  structure  of  the  algorithm,  the 
constants  C).C2.C3  can  always  be  chosen  sufficiently  large  to  satisfy  the  base  case  and 
both  the  cases  (1)  and  (2)  of  the  inductive  step.  This  completes  the  proof  of  the  lemma. 
□ 


The  quantity  P  •  is  about  P  •  A'*’  *®.  We  next  show  a  particular  pattern  of 

failures  for  which  the  completed  work  of  algorithm  A'  matches  this  upper  bound. 


Lemma  3.16  There  exists  a  pattern  of  fail-stop/restart  errors  that  cause  the  algo¬ 
rithm  .V  to  perform  S  =  n(A’’°*^)  work  on  the  input  of  size  N  using  P  =  N  processors. 
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Proof:  We  can  compute  the  exact  work  performed  by  the  algorithm  when  the  adversary 
adheres  to  the  following  strategy: 

(a)  The  processor  with  PID  0  will  be  allowed  to  sequentiaUy  traverse  the  progress  tree 
in  post-order  starting  at  the  leftmost  leaf  and  finishing  at  the  rightmost  leaf. 

(b)  The  processors  that  find  themselves  at  the  same  leaf  as  processor  0  are  (re)started 
and  allowed  to  traverse  the  progress  tree  until  they  reach  a  leaf,  where  they  are  failed. 

(c)  Procedure  (b)  is  repeated  until  all  leaves  are  visited. 

Thus  the  leaves  of  the  progress  tree  are  visited  left  to  right,  from  the  leaf  number 
1  to  the  leaf  number  N .  At  any  time,  if  i  is  the  number  of  the  rightmost  visited  leaf, 
then  only  the  processors  with  PIDs  0  to  i  —  1  have  performed  at  least  one  update  cycle 
thus  far. 

The  cost  of  such  strategy  can  be  expressed  inductively  as  follows: 

The  cost  C'l  of  traversing  a  tree  of  size  1  using  a  single  processor  is  1  (unit  of  completed 
work). 

The  cost  Ci+i  of  traversing  a  tree  of  size  2*"^^  is  computed  as  follows:  first,  there  is  the 
cost  Ci  of  traversing  the  left  subtree  of  size  2'.  Then,  all  processors  move  to  the  right 
subtree  and  participate  (subject  to  failures)  in  the  traversal  of  the  right  subtree  at  the 
cost  of  2C,  —  the  cost  is  doubled,  because  the  two  processors  whose  PIDs  are  equal 
modulo  2  behave  identically.  Thus  C,+j  =  3C,,  and  CjogA?  =  3*°*^  =  □ 

Now  we  show  how  to  use  algorithm  X  with  P  processors  to  solve  Write- All  problems 
of  size  N  such  that  P  <  N .  Given  an  array  of  size  A^,  we  break  the  N  elements  of  the 
input  into  y  groups  of  P  elements  each  (the  last  group  may  have  fewer  than  P  elements). 
The  P  processors  are  then  used  to  solve  ^  Write-All  problems  of  size  P  one  at  a  time. 
We  call  this  algorithm  X' ,  and  we  wiD  use  X'  in  the  general  simulations. 


Remark  3.4  Strictly  speaking,  it  is  not  necessary  to  modify  algorithm  X  for  P  <  N 
processors.  Algorithm  A'  can  be  used  with  P  <  N  processors  by  initially  assigning  the 
P  processors  to  the  first  P  elements  of  the  array  to  be  visited.  It  can  also  be  shown 
that  .V  and  .V'  have  the  same  asymptotic  complexity;  however,  the  analysis  of  .V'  is 
very  simple,  as  we  show  below. 
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Theorem  3.17  Algorithm  X'  with  P  processors  solves  the  Write-All  problem  of  size 
A  {P  <  N)  using  completed  work  5  =  0{N  In  addition,  there  is  an  adversary 

that  forces  algorithm  X'  to  perform  S  =  Q(N  •  work. 

Proof:  By  Lemma  3.15,  5p,p  =  0{P  •  Thus  the  overaD  work  will 

be  5  =  Oi^Sp,p)  =  C>(^P‘°83)  ^  •  P‘»8f ). 

Using  the  strategy  of  Lemma  3.16,  an  adversary  causes  the  algorithm  to  perform 
work  Sp^p  =  n(P’°*^)  on  each  of  the  ^  segments  of  the  input  array.  This  results  in  the 
overall  work  of  5  =  =  fl(A  •  P’°*5).  □ 


Remark  3.5  Lemma  3.14  gives  only  a  loose  upper  bound  for  the  worst  performance  of 
algorithm  A'  —  there  we  are  concerned  with  termination.  The  actual  worst  case  time 
for  algorithm  X  can  be  no  more  than  the  upper  bound  on  the  completed  work.  This 
is  because  at  any  point  in  time  there  is  at  least  one  update  cycle  that  will  complete. 
Therefore,  for  algorithm  X'  with  P  <  N,  the  time  is  bounded  by  0{N  ■  P’°^t).  In 
particular,  for  P  =  N,  the  time  is  bounded  by  0(JV^°^^).  In  fact,  using  the  worst  case 
strategy  of  Lemma  3.16,  an  adversary  can  “time  share”  the  completed  cycles  of  the 
processors  so  only  one  processor  is  active  at  any  given  time,  with  the  processor  with 
PID  0  being  one  step  ahead  of  other  processors.  The  resulting  time  is  then 

Remark  3.6  In  algorithm  X  the  processors  work  independently;  they  attempt  to  avoid 
duplicating  already-completed  work  but  do  not  co-ordinate  their  actions  with  other 
processors.  In  [27],  Buss  et  al.  show  that  this  property  allows  the  algorithm  to  run  on 
a  strongly  asynchronous  PR.4M  with  the  same  work  and  time  bounds. 

A  fail-stop  lower  bound  for  algorithm  X 

The  analysis  of  the  upper  bound  for  algorithm  A'  for  the  fail-stop  no-restart  model  is 
still  an  open  question.  In  this  section  we  show  and  analyse  a  particular  failure  scenario. 

The  failure  scenario  is  based  on  the  strategy  of  the  adversary  that  is  used  in  the  proof 
of  Theorem  4.4.  For  algorithm  A',  this  strategy  is  that  if  U  is  the  number  of  unvisited 
progress  tree  leaves,  then  in  each  step  where  the  unvisited  leaves  have  processors  assigned 
to  them,  the  adversary  stops  the  processors  that  are  assigned  to  the  rightmost  f'/logf 
leaves.  For  ,V  =  P  =  64,  this  strategy  is  illustrated  in  Figure  3.6. 
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Horizontal  boxes  represent  processor  “waves”  traversing  the  tree.  The  nodes  that  are 
marked  by  the  processors  as  “visited”  are  never  again  traversed.  These  nodes  and  the 
corresponding  tree  edges  are  erased  for  clarity  -  the  upward  moving  waves  appear  to  be 
“consuming”  the  tree.  When  two  waves  “collide”,  they  are  depicted  as  two  overlapping 
boxes  (steps  12,15,17).  After  a  collision  the  waves  are  merged. 

The  adversary  stops  only  the  processors  that  are  visiting  the  leaves.  At  step  1,  the 
processors  assigned  to  the  16  (=  A’/ log  A)  rightmost  leaves  are  stopped.  No  processor 
is  stopped  in  steps  2-10.  At  step  11,  the  processors  assigned  to  the  4  (=  16/ log  16) 
rightmost  leaves  are  stopped.  At  stei'  13,  the  processors  assigned  to  the  2  (=  4/log4) 
rightmost  leaves  are  stopped.  The  computation  terminates  in  five  more  steps  after  step 
18,  as  the  single  remaining  wave  “gobbles  up”  the  path  to  the  root. 

Figure  3.6:  A  fail-stop  scenario  for  algorithm  A'  with  P  =  N  =  64. 
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Theorem  3.18  There  is  a  processor  failure  pattern  for  algorithm  X  that  results  in 
5  =  0(.VlogAnoglogA71ogloglogjV)  for  P  =  N. 

Proof:  The  adversary  uses  the  strategy  similar  to  that  in  Theorem  4.4,  except  that 
instead  of  PRAM  steps,  the  adversary  stops  the  processors  only  when  the  waves  of 
processors  reach  the  unvisited  leaves.  We  examine  the  following  stages  (stages  1  and  2 
are  illustrated  in  Figure  3.7): 

Stage  I:  The  adversary  stop  Nf\ogN  processors  assigned  to  the  rightmost  N/\ogN 
leaves.  The  surviving  N  —  A^log  A'  processors  are  allowed  to  traverse  the  tree  bottom- 
up  and  then  top-down  (after  reaching  a  node  such  that  only  one  child  is  marked  1) 
until  the  first  wave  of  processors  reaches  the  leaves.  There  will  be:  log  log  TV  =  log  A’  — 
log(A71ogA")  such  waves.  (This  is  illustrated  in  Figure  3.7,  Stage  1.) 

As  each  wave  reaches  the  leaves,  the  processors  that  are  assigned  to  the  rightmost 
logarithmic  fraction  of  the  leaves  is  stopped  (in  this  proof,  it  is  sufficient  to  use  a  log  N 
fraction,  while  in  the  example  in  Figure  3.6  we  use  a  logarithm  of  the  remaining  number 
of  unvisited  leaves  -  asymptotically,  both  logarithms  are  equal).  After  the  last  wave 
reaches  the  leaves,  there  would  be  Q{N /  log^®*'°*^  N)  leaves  left.  Assume  for  simplicity 
that  this  number  is  a  power  of  two  (it  will  be  for  some  sufficiently  large  N,  which  is 
sufficient  for  lower  bounds).  Note  that  the  number  of  surviving  processors  is  still  0(A^) 
(=  A^  -  A71ogA'). 

Stage  2:  The  processors  will  have  to  traverse  a  path  of  length  0(log(TV/  log*®*'”*^  ^))  = 
0(logA'  -  loglog^A^)  to  find  unvisited  leaves.  This  time  there  wiU  be:  (logA'^  - 
log  log  A7  -  (log  AT  -  loglog^  A)  =  loglog^  A^  -  log  log  A'  =  0(loglog^  N)  waves  (sub¬ 
tracting  subtree  heights).  This  is  illustrated  in  Figure  3.7,  Stage  2.  The  number  of 
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surviving  processors  is  still  Q{N),  as  only  a  diminishing  polylogarithmic  fraction  of 
each  wave  is  stopped. 

In  Stage  2,  there  will  be  0(loglog*IV)  waves,  and  the  total  number  of  live  processors 
is  still  0(7V)  processors.  Each  wave  wiU  have  to  traverse  paths  of  length  0(log  — 
log  log'  N). 

The  algorithm  terminates  in  stage  r,  such  that  r  =  log  log  A^/ log  log  log  iV  at  which 
time  the  remaining  leaf  (or  a  constant  number  of  leaves)  is  visited.  While  there  are  still 
0(A')  processors  are  active.  The  work  performed  is  as  follows: 

N  Er=i(log  N  -  log  log'  AT)  =  AVlogiV  -  ZUi  log  log'  N 

=  Q{N  log  N  log  log  N/  log  log  log  N  —  N  log  N/\og]ogN) 

=  0(  N  log  N  log  log  N/  log  log  ]ogN)  . 

a 


Recently  Lopez-Ortiz  refined  the  particular  scenario  used  in  the  proof  above  to 
exhibit  the  known  worst  fail-stop  work  for  algorithm  X  of  0(A^  log^  A^/loglog  A^)  [68]. 
As  the  corollary  of  this  result,  the  upper  bound  for  algorithm  X  is  no  better  than  the 
upper  bound  for  algorithm  W  for  the  fail-stop  no-restart  model. 

3.3.2  Algorithms  Xcoin  and  Xd,e 

In  this  thesis  we  investigate  deterministic  solutions  and  concentrate  on  the  worst  case 
analysis.  Randomized  algorithms  for  the  Write-All  problem  are  capable  of  improving 
on  the  upper  bounds  of  the  deterministic  algorithms  only  when  the  adversary  is  off-line 
as  in  [75]  or  when  the  adversary  is  limited  probabilistically  as  in  [59,  61]. 

A  randomized  asynchronous  coupon  clipping  {ACC)  algorithm  for  the  Write-All 
problem  was  analyzed  by  Martel  et  al.  in  [75].  Assuming  off-line  adversaries,  it  was 
shown  in  [75]  that  ACC  algorithm  h8is  expected  work  0{N)  using  P  =  N/{\og  N  log*  N) 
processors  on  inputs  of  size  N .  However,  when  the  adversary  is  on-line  then  the  algo¬ 
rithm  becomes  inefficient  even  for  simple  on-line  adversaries. 

Example  3.4  Stalking  adversary:  In  the  on-line  case,  we  observe  that  a  simple  stalking 
adversary  causes  the  ACC  algorithm  to  perform  (expected)  work  of  n(  polylog  A  ) 
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in  the  case  of  fail-stop  errors,  and  Q  (( ^  j  work  in  the  case  of  fail- stop 
errors  with  restart  even  when  using  P  <  p^iy^g^v  processors. 

Algorithm  ACC  uses  as  its  main  data  structure  a  full  binary  tree  similar  to  the 
progress  tree  of  algorithm  X.  In  algorithm  ACC,  processors  randomly  visit  the  nodes 
of  that  binary  tree.  The  stalking  adversary  strategy  consists  of  choosing  a  single  leaf  in 
the  tree  employed  by  ACC,  and  failing  all  processors  that  touch  that  leaf  until  only  one 
processor  remains  in  the  fail-stop  case,  or  until  all  processors  simultaneously  touch  the 
leaf  in  the  fail-stop/restart  case.  This  performance  is  not  improved  even  when  using  the 
completed  work  accounting.  On  a  positive  note,  when  the  adversary  is  made  off-line, 
the  ACC  algorithm  becomes  efficient  in  the  fail-stop/restart  setting. 

Stalking  adversaries  can  and  do  occur  in  practice.  Consider  a  processor  or  processors 
that  repeatedly  attempt  to  read  from  a  “bad”  memory  location,  or  a  processor  that 
executes  an  instruction  whose  microcode  is  partially  corrupt  -  aU  such  processors  may 
be  failing  at  the  same  instruction  either  indefinitely,  or  in  an  intermittent  pattern.  □ 

Here  were  present  two  randomized  versions  of  algorithm  X ,  and  we  conjecture  that 
these  algorithms  have  the  potential  of  improving  on  the  performance  of  algorithm  X  even 
when  the  adversary  is  the  worst  case  on-line  adversary.  In  particular,  either  algorithm 
does  not  allow  the  stalking  adversary  to  cause  quadratic  or  worse  amount  of  work. 


Algorithm  X^om 

This  algorithm  uses  the  data  structures  of  algorithm  A'  and  it  is  identical  to  algorithm 
A,  except  that  in  line  09  of  Figure  3.3  instead  of  making  a  move  according  to  the  PID 
bits,  a  processor  chooses  to  move  left  or  right  after  flipping  a  coin. 

The  potential  advantages  of  this  randomization  is  that  if  more  than  one  processor 
find  themselves  at  the  same  interior  node,  then,  should  they  all  survive  the  next  step,  it 
is  expected  that  half  of  them  will  move  right  and  half  left  if  the  corresponding  subtrees 
are  not  completed. 

This  strategy  might  yield  better  expected  performance  than  algorithm  A’,  because 
the  processors  from  the  same  subtrees  will  fan-out  to  the  uncompleted  portion  of  the 
progress  tree  sooner. 
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Example  3.5  Inefficient  progress  tree  traversal:  In  Figure  3.8,  let  and  Tp  be  sub¬ 
trees  of  heights  h  that  are  identically  positioned  within  the  left  and  right  subtrees  of 
the  progress  tree  respectively.  Suppose,  either  in  algorithm  X  or  algorithm  Xcoin,  all  p 
processors  from  the  subtree  Ti  reach  the  root,  and  the  right  subtree  is  not  completed. 

In  algorithm  X  the  processors  might  synchronously  reach  the  root  of  subtree  Tr  at 
height  h  without  fanning  out  throughout  the  right  subtree  because  the  processors  will 
make  identical  down-moves  along  the  path  of  length  log  N  —  h  from  the  root  since  they 
use  identical  PID  bits  along  the  path  having  originated  in  the  same  subtree  {N  is  the 
number  of  leaves). 

However  in  algorithm  Xcoini  the  processors  will  probabilistically  fan-out  at  the  root 
of  the  right  subtree.  Thus,  instead  of  p  processors  being  at  the  leaves  of  tree  Tr,  the 
processors  will  be  balanced  (probabilistically)  throughout  the  leaves  of  the  entire  right 
subtree.  □ 

Algorithm  Xdie 

This  algorithm  uses  the  data  structures  of  algorithm  X  and  it  is  similar  to  algorithm 
A'  with  the  following  two  differences: 

1.  Instead  of  using  binary  values  at  the  progress  tree  nodes  indicating  whether  the 
subtree  is  completed  or  not,  algorithm  Xdie  uses  the  integer  values  at  the  nodes 
to  represent  the  known  number  of  descendent  leaves  visited  by  the  algorithm  as 
done  in  algorithms  V  and  W. 


2.  Instead  of  making  a  move  according  to  the  PID  bits  (line  09  of  Figure  3.3),  a 
processor  at  an  interior  progress  tree  node  casts  an  A-sided  die  to  produce  a 
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random  number  r  in  the  range  [l,iV],  reads  the  values  and  of  the  left  and 
right  children  of  the  progress  tree  and  then; 

•  if  =  ^2  =  0)  then  the  processor  moves  left  and  right  with  probabilities  i; 

•  if  di  +  ^2  /  0,  then  the  processor  moves  left  if  1  <  r  <  N  ^  where  h 

is  the  height  of  the  interior  progress  tree  node,  and  it  moves  right  otherwise 
—  this  is  done  to  send  processors  left  or  right  with  probabilities  that  are 
proportional  to  the  remaining  work  in  the  left  -di)  and  right  (2^“’  -d2) 
subtrees.  (This  is  similar  to  the  deterministic  divide  and  conquer  strategy 
used  in  algorithm  W .) 

This  algorithm  benefits  from  the  earlier  fan-out  of  the  processors  just  like  in  al¬ 
gorithm  Xcoin-  In  addition,  in  algorithm  Xdu  the  processors  take  advantage  of  the 
knowledge  of  the  remaining  amount  of  work  in  the  left  and  right  subtree  at  any  node 
of  the  progress  tree,  and  attempt  to  move  down  to  the  leaves  in  numbers  that  are 
proportional  to  the  remaining  work  in  the  subtrees. 

We  conjecture  that  the  expected  work  of  both  the  algorithm  Xcoin  and  Xjie  is  better 
than  the  worst  case  work  of  algorithm  X  subject  to  the  worst  case  on-line  adversary. 


3.4  Hashed  Allocation  Paradigm 

The  final  technique  is  demonstrated  using  a  new  heuristic  for  determinizing  an  efficient 
randomized  Write-All  solution  proposed  by  Anderson  and  Woll  in  [8]. 

3.4.1  Algorithm  Y 

A  family  of  randomized  algorithms  for  Write- All  was  presented  in  [8].  The  basic  tech¬ 
nique  in  all  of  these  algorithms  is  abstracted  and  given  as  a  high  level  code  in  Figure  3.9. 
The  basic  algorithm  in  [8]  is  obtained  by  randomly  choosing  the  permutation  in  line  03. 
In  this  case  the  expected  work  of  the  algorithm  is  0(Alog  A"),  for  P  =  y/N  (assume  A 
is  a  square). 

We  propose  the  following  way  of  determinizing  the  algorithm  of  [8]:  Given  P  =  y/N, 
we  chose  the  smallest  prime  m  such  that  P  <  m.  Primes  are  sufficiently  dense,  so  that 
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01  forall processors  PID  =  parbegin 

02  Divide  the  N  array  elements  into  y/N  work  groups  of  y/N  elements 

03  Each  processor  PID  obtains  a  private  permutation  of  {1,2,...  yVN) 

04  for  i  =  l..y/N  do 

05  if  a-p,o[!]th  group  is  not  finished 

06  then  perform  sequential  work  on  the  i/N  elements  of  the  group 

07  and  mark  the  group  as  finished 

08  fi 

09  od 

10  parend 

Figure  3.9:  A  high  level  view  of  the  algorithm  Y  -  hashed  allocation  paradigm. 


there  is  at  least  one  prime  between  P  and  2P  (e.g,  see  the  discussion  in  [32,  Sec.  33.8]), 
so  that  the  complexity  of  the  algorithms  is  not  distorted  when  P  is  not  a  prime.  We 
then  construct  the  multiplication  table  for  the  numbers  1,2, . .  .m  —  1  modulo  m.  It  is 
not  difficult  to  show  that  each  row  of  this  table  is  a  permutation  and  that  this  structure 
is  a  group  (using  the  basic  group  theory  facts,  e.g.,  [23]).  Processor  with  PID  i  uses  the 
fth  permutation  as  its  schedule. 

Note  that  the  table  need  not  be  pre-computed,  as  any  item  can  be  computed  directly 
by  any  processor  with  the  knowledge  of  its  PID,  and  the  number  of  work  elements  w  it 
has  processed  thus  far  as  (PID-w)  mod  m.  A  detmled  pseudo-code  for  the  deterministic 
algorithm  Y  is  given  in  Figure  3.10. 

We  conjecture  that  the  worst  case  work  of  this  deterministic  algorithm  is  no  worse 
than  the  expected  work  of  the  randomized  algorithm. 

The  open  problem  below  contains  an  interesting  observation  of  a  group-theoretic 
aspect  of  a  multi-processor  scheduling  problem  [93]. 

An  open  problem:  What  is  the  completed  work  of  algorithm  Y  with  the  proposed 
determinization?  We  have  performed  some  experimental  analysis  and  all  cases  it  re¬ 
sulted  in  the  the  work  being  is  0{N  ]ogN).  This  is  the  same  as  the  expected  work 
using  random  permutations.  We  next  briefly  state  the  relevant  framework,  and  then 
state  the  open  problem. 


In  [8],  the  analysis  of  the  randomized  algorithm  is  based  on  the  definition  of  a 
measure,  called  contention,  that  evaluates  the  “overlap”  of  a  permutation  representing 
a  processor  schedule  with  respect  to  a  permutation  representing  a  scheduling  adversary: 
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01  forall  processors  PID  =  1..P  =  \/N  parbegin 

02  — The  array  elements  ore  viewed  as  divided  into  y/N  work  groups 

03  — of  y/N  elements  in  each:  x[l  +  \/]V..2\/iV],  etc. 

04  shared  x[l.. AT];  - shared  memory 

05  shared  donefL.-v/N];  done  markers 

06  shared  — done  by  each  processor 

07  private  k;  local  workgroup  number  per  processor 

08  w[PID]  :=  1;  — the  initial  workgroup  to  do 

09  while  w[PID]  ^  0  do While  not  all  y/N  groups  done 

10  k  :=  (PID  ■  w[PID])  mod  m - current  workgroup  number 

11  if  not  done  [fc]  — group  is  not  marked  finished 

12  then - perform  sequential  work  on  the  y/N  elements  of  the  group 

13  for  i  =  l..y/N  do 

14  x[(tt[P/D]- +  :=  1; 

15  od 

16  done[k]  true - mark  workgroup  as  finished 

17  fi 

18  w[PID]  :=  w[PID]  +  1;  — advance  processor’s  groups  done  counter 

19  od 

20  parend 

Figure  3.10:  A  detailed  view  of  the  deterministic  algorithm  Y. 


Definition  3.3  [8]  Given  two  permutations  tt  and  a  represented  as  lists  of  integers 
the  contention  of  tt  with  respect  to  a,  denoted  C(7r,a)  is  defined  as  follows: 
scan  a  left-to-right,  and  for  each  encountered  item,  delete  that  item  from  tt,  C(7r,Q}  is 
the  number  of  times  the  deleted  item  is  at  the  head  of  the  list  tt. 


For  example,  (r'((2, 4, 1, 3).  {1,2,3,4))  =  2,  and,  C((3, 2, 1),  (2, 1,3))  =  1. 
Contentions  for  a  set  of  permutations  is  defined  as  follows: 

Definition  3.4  [8]  Given  a  set  of  permutation  11  =  a)  d  a  permutation  q, 

C(n,Q)  is  defined  as  C(7r,,Q). 

Intuitively,  contention  measures  redundant  work  performed  by  an  algorithm  that 
uses  hashed  allocation  paradigm.  The  key  relevant  results  shown  in  [8]  are: 

Lemma  3.19  [8]  For  permutations  tt  and  a,  C(7r,a)  is  equal  to  the  number  of  left-to- 
right  maxima  in  o  jr,  where  o  is  the  permutation  composition. 
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Lemma  3.20  [8]  If  in  an  algorithm  Y  with  p  processors,  each  processor  nses  a  permu¬ 
tation  from  n  =  {tti,  ...,7rp}  as  its  schedules,  then  the  worst  case  work  of  the  algorithm 
is  0{p  ■  maXaC{U,a)). 

When  we  consider  the  group  of  all  permutations  of  integers  {1,  ....p}  with  the  per¬ 
mutation  composition  o  being  the  multiplication  operation  for  the  group,  the  following 
corollary  foUows; 

Corollary  3.21  For  any  two  permutations  a,  0:  C{0,o)  =  C{a~^  o  0,e),  where  e  is 
the  identity  permutation. 

We  now  use  the  deterministic  algorithm  F  with  the  permutations  being  computed 
deterministically  as  we  proposed,  and  reduce  our  conjecture  to  a  problem  below  that 
exhibits  an  interesting  connection  between  multiprocessor  scheduling  on  one  hand  and 
group  theory  and  combinatorics  on  the  other. 

The  P  permutations  that  are  computed  by  the  processors  constitute  a  group.  Call 
it  11.  Using  the  above  corollary,  we  observe  that  the  contention  of  the  set  of  P  per¬ 
mutations  with  respect  to  any  permutation  q  is  C(n,Q)  and  it  is  the  same  as  the 
contention  of  the  left  coset  of  11,  oil  with  respect  to  the  identity  permutation,  that 
is  C(q~^  o  n,e).  This  allows  us  to  reduce  the  problem  to  the  following  (the  details  are 
left  as  an  exercise). 

In  order  to  show  that  the  worst  case  work  of  Y  is  0(N  \og  N),  using  the  above 
framework,  it  is  sufficient  to  show  that: 

Given  a  prime  m,  consider  the  group  G  =  ({1,2,  ...,m  -  1},*  (mod  m)). 

The  multiplication  table  for  G,  when  the  rows  of  the  table  are  interpreted 
as  permutations  of  {1, . . .,  m  —  1},  is  a  group  K  of  order  m  -  1  (a  subgroup 
of  all  permutations).  Show  that,  for  each  left  coset  of  A'  (with  respect  to  all 
permutations)  the  sum  of  the  number  of  left-to-right  maxima  of  aU  elements 
of  the  coset  is  0(m  log  t7i). 
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Chapter  4 

Write- All  Lower  Bounds  With 
Memory  Snapshots 

Adversaries  considered  in  this  work  are  very  powerful  —  no  optimal  iV-processor 
solutions  exist  for  the  Write-All  problem  even  if  the  processors  have  the  ability 
of  taking  instant  memory  snapshots. 

In  this  chapter  we  show  that  for  any  algorithm  that  implements  an  iV-processor  ro¬ 
bust  solution  to  the  Write- All  problem  in  either  the  no- restart  fail-stop  or  the  restartable 
failstop  model,  a  failure  pattern  can  be  constructed  that  wiU  cause  the  algorithm  to  per¬ 
form  a  superlinear  number  of  processing  steps.  That  is,  there  are  no  work-optimal 
A^-processor  solutions  in  these  models  (in  contrast,  optimality  can  be  achieved  by  ex¬ 
ploiting  parallel  slack,  i.e.,  using  fewer  than  N  processors). 

The  lower  bound  results  apply  to  the  worst  case  work  of  the  deterministic  algorithms, 
and  the  expected  work  of  deterministic  and  randomized  algorithms  that  are  subject  to 
dynamic  on-line  adversaries.  These  results  hold  even  under  the  additional  assumption 
that  processors  can  read  and  locally  process  all  the  shared  memory  at  unit  cost.  We 
also  show  that  concurrent  writes  are  necessary  for  the  existence  of  robust  algorithms  — 
concurrent  writes  are  an  important  source  of  redundancy  in  our  approach. 
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4.1  Lower  Bounds  for  the  No-Restart  Fail-Stop  Model 

We  now  present  a  lower  bound  that  holds  for  the  fail-stop  PRAMs  as  well  as  for  a 
much  stronger  model  where  the  processors  can  take  unit  time  memory  snapshots,  i.e., 
processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost. 

We  first  list  three  simple  mathematical  lemmas  (whose  proofs  are  given  in  Ap¬ 
pendix  A),  then  state  the  main  lower  bounds  theorem  and  its  proof. 


Lemma  4.1  Given  a  sorted  list  of  m  (m  >  1)  nonnegative  integers  ai,a2, . .  ■  ,a„i  then 
we  have  for  all  j  (1  <  j  <  m)  that  (  1 - j  ^  a^. 

.=1  ,=j+i 


Lemma  4.2  Given  G  >  l,N  >  G,  and  integer  o  such  that  cr  <  -  1  ,  then 

the  following  inequality  holds:  [. . .  [[N  JG\IG\  . .  ./GJ  >  0  (where  a  is  the  number  of 

c  times 

divisions  by  G). 


Lemma  4.3  For  N 


00  : 


.  .  jogjy- 

1  \  log  log  JV 

log  TV/ 


=  1 


1 

log  log  N 


AO 


((loglogA^)2 


In  the  theorem,  we  will  make  use  of  Lemma  4.1  with  G  =  log  N . 


Theorem  4.4  Given  any  (deterministic  or  randomized)  TV-processor  CRCW  PRAM 
algorithm  that  solves  the  Write- All  problem,  the  adversary  can  force  fail-stop  errors  that 
result  in  steps  being  performed  by  the  algorithm,  even  if  the  processors 

can  read  and  locally  process  all  shared  memory  at  unit  cost. 


Proof:  We  are  going  to  present  a  strategy  for  the  adversary  that  results  in  this  worst 
case  behavior.  Let  A  be  the  best  possible  algorithm  that  implements  a  robust  solution 
for  the  Write- All  problem.  Each  processor  participating  in  the  algorithm  is  allowed  to 
read  the  entire  shared  memory,  and  locally  perform  arbitrary  computation  on  it  in  unit 
time. 

Let  pQ  =  N  be  the  initial  number  of  processors,  and  Uq  =  N  he  the  initial  number  of 
unvisited  array  elements.  The  strategy  of  the  adversary  is  outlined  below.  Step  numbers 
refer  to  the  PRAM  steps  (not  to  be  confused  with  block-steps  or  loop-iterations  used 
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in  Section  3.2.1).  For  each  step,  the  adversary  will  be  determining  what  processors  are 
going  to  write  to  what  shared  memory  locations. 

Step  1:  The  adversary  chooses  U\  =  [Uo/iogUo\  array  elements  with  the  least  number 
of  processors  assigned  to  them.  This  can  be  done  since  the  adversary  knows  aU  the 
actions  to  be  performed  by  A.  The  adversary  then  fail-stops  the  processors  a^ssigned  to 
these  array  elements,  if  any. 

To  estimate  the  number  of  surviving  processors  and  to  express  this  mathematically, 
we  will  be  using  Lemma  4.1  with  the  following  definitions: 

Let  m  =  Uq,  and  let  Ci, . .  .,am  be  the  sorted  in  ascending  order  quantities  of  proces¬ 
sors  assigned  to  each  array  element,  moreover,  let  Um  also  include  the  quantity  of  any 
un-assigned  processors  (i.e.,  a\  is  the  least  number  of  processors  assigned  to  an  array  el¬ 
ement,  a2  is  the  next  least  quantity  of  processors,  etc.).  Let  j  =  U\.  Thus  the  adversary 
failed  exactly  JZLi  processors.  The  initial  number  of  processors  is:  YlTLi  “t  =  ^o, 
therefore,  the  number  of  surviving  processors  P\  is:  cii  =  Pi  ■  Using  Lemma  4.1, 

we  get: 


Pi  >  il-UilUo)Po 

or.  after  substituting  for  Ui  and  using  the  properties  of  floor. 

Step  2:  The  adversary  again  chooses  among  the  Ui  remaining  unvisited  array  elements 
U2  =  [Ui/  log  f/oj  elements  with  the  least  number  of  processors  assigned  to  them.  Using 
Lemma  4.1  again  in  a  similar  way: 


2  d. 


[f/i/logUoJ 

Ui 


)f,  =  (l- 


LLUo/logUoJ/logUoJ^ 


Pi 


[Uo/logUoJ 

.  (,  [Uo/logt/oJ/logUo^p  ^  (.  1  \  >  fi__±_Yp 

-  V  [Uo/\ogUo\  V  logUoJ^’  -  V  logUoJ 

Step  i:  The  adversary  chooses  among  Ui-i  unvisited  array  elements  U,  =  [f/,_i/log{/oj 
elements  with  least  number  of  processors  assigned  to  them.  Again,  applying  Lemma  4.1: 


I',  log  ^  1 

V  C.-1  )  -'  -  ' 

k  \ogUoJ 

This  process  is  repeated  for  as  long  as  there  are  any  unvisited  array  elements,  at 
which  point  the  surviving  processors  will  successfully  terminate  the  algorithm.  Let  p 
be  the  step  at  which  the  last  unvisited  element  is  finally  visited.  Let  us  use  Lemma 
4.2  with  G  =  log  A  and  a  the  largest  integer  such  that  o  <  log  A/ log  log  A  -  1.  Then 
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U„  —  [. . .  [[iV  /GJ/GJ  . .  -/Gl  >  0,  and  so  p  must  be  greater  than  o  because  Up  =  0. 

^  V  " 

o  times 

Thus  we  have  p  >  -  1  =  ^  -l>  a  . 

log  log  Uq  log  log  iV 

We  want  to  estimate  S  =  J2i=o  Pi-  By  the  adversary  strategy  given  above,  for  all 
PRAM  steps  i:  P.  >  (1  -  i^)‘Po.  Therefore: 

S  =  ^  P,  ^  ^  (^1  ~  log  J  summation  of  a  geometric  progres¬ 

sion  we  obtain; 


Using  the  result  of  Lemma  4.3:  (1  -  =  1  -  +  0(  (i^i^-yv)^ ),  we 

obtain  the  following  lower  bound  on  the  number  of  PRAM  processor  steps: 


^  ^  (p  log^  ^ 
\N  V  °(loglog  A/’)2y 


=  Pq 


log  log 


log  A^ 

Therefore  5  =  0,(N- — - — )  .  □ 

log  log  N 


We  use  this  result  in  Section  3.2.1  to  exhibit  a  processor  failure  pattern  for  algorithm 
W  that  results  in  the  worst  case  behavior  of  algorithm  W  that  corresponds  to  work 
5  =  0(A'log^  TV/ log  log  A'^). 


Remark  4.1  The  lower  bound  of  ^{N is  the  strongest  possible  bound  for  the 
fail-stop  model  without  restarts  under  the  memory  snapshot  assumption.  This  can  be 
shown  in  a  straightforward  way  by  adapting  the  analysis  of  algorithm  W  by  Martel  [71] 
(Theorem  3.4bis  in  Appendix  A.6).  According  to  the  analysis,  the  number  of  “block- 
steps”  of  W  for  P  =  TV  is  0( TV  log  TV/ log  log  TV)  and  each  block-step  can  be  realized  at 
unit  cost  each,  under  the  memory  snapshot  assumption. 


We  close  this  section  with  a  comment  on  the  knowledge  of  the  adversary  used  in  the 
proofs.  The  adversary  dynamically  fails  processors  based  solely  on  their  intent  to  write 
into  the  array  to  be  initialized  by  the  algorithm.  The  adversary  uses  no  knowledge 
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about  the  methods  that  the  algorithms  use  in  selecting  the  array  elements  to  write 
into.  Therefore  the  worst  case  behavior  caused  by  such  adversary  will  also  apply  to  the 
cases  where  the  algorithm  uses  a  stronger  PRAM  model  (such  as  the  ideal  PRAM  of 
Beame  and  Hastad  [20])  with  arbitrarily  powerful  instruction  set  or  the  cases  where  the 
algorithm  makes  probabilistic  decisions  (such  as  using  a  coin  toss). 

On  the  other  hand,  when  the  adversary  is  limited  as  in  Martel  et  al.  [75]  and  Kedem 
et  al.  [59]  by  using  off-line,  oblivious  adversaries,  or  the  adversaries  that  are  limited 
stochastically,  better  expected  work  has  been  reported  for  CRCW  PRAMs. 

4.2  Lower  Bounds  for  the  Restartable  Fail-Stop  Model 

As  we  have  shown  in  Example  2.4  in  Section  2.6,  without  the  update  cycle  accounting 
there  is  a  thrashing  adversary  that  exhibits  a  quadratic  lower  bound  for  the  Write- 
All  problem  in  the  restartable  fail-stop  model.  When  the  update  cycle  accounting  is 
introduced,  we  showed  in  Section  3.3.1  that  there  is  a  sub-quadratic  solution.  With 
the  update  cycle  accounting  we  now  show  N  —  P  n(Plog  P)  work  lower  bound  (when 
P  <  N),  even  when  the  processors  can  take  unit  time  memory  snapshots,  i.e.,  processors 
can  read  and  locally  process  the  entire  shared  memory  at  unit  cost. 

Theorem  4.5  Given  any  P-processor  CRCW  PRAM  algorithm  that  solves  the  Write- 
All  problem  of  size  N  {P  <  TV),  an  adversary  (that  can  cause  arbitrary  processor  failures 
and  restarts)  can  force  the  algorithm  to  perform  TV  —  P  -|-  n(PlogP)  completed  work 
steps. 

Proof:  Let  Z  be  any  algorithm  for  the  Write- All  problem  subject  to  arbitrary  fail¬ 
ure/restarts  using  update  cycles.  Consider  each  PRAM  cycle.  The  adversary  uses  the 
following  strategy: 

Let  P  >  1  be  the  number  of  unvisited  array  elements.  For  as  long  as  17  >  P,  the 
adversary  induces  no  failures.  The  work  needed  to  visit  TV  —  P  array  elements  when 
there  were  no  failures  is  at  least  TV  —  P. 

As  soon  as  a  processor  is  about  to  visit  the  element  TV  —  P  -|-  1  making  U  <  P, 
the  adversary  fails  and  then  restarts  all  TV  processors.  For  the  upcoming  cycle,  the 
adversary  determines  how  the  algorithm  assigns  processors  to  write  to  array  elements. 
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By  an  averaging  argument,  for  any  processor  assignment  to  the  U  elements,  there  is  a 
set  of  [yJ  unvisited  elements  with  no  more  than  processors  assigned  to  them.  The 
adversary  fails  these  processors,  allowing  all  others  to  proceed.  Therefore  at  least 
processors  will  complete  this  step  having  visited  no  more  than  half  of  the  remaining 
unvisited  array  locations. 

This  strategy  can  be  continued  for  at  least  log  P  iterations.  The  work  performed  by 
the  algorithm  will  he  S  >  N  -  P  +  \—\  \ogP  =  N  —  P  n(Plog P).  □ 

Note  that  the  bound  holds  even  if  processors  are  only  charged  for  writes  into  the 
array  of  size  N  and  do  not  have  to  only  write  the  value  1.  The  simplicity  of  this  strategy 
ensures  that  the  results  hold  in  the  strongly  asynchronous  model. 

Theorem  4.6  Any  iV-processor  strongly  asynchronous  PRAM  algorithm  that  solves 
the  Write- All  problem  of  size  N  has  total  work  N  —  P  -\-  f2(PlogP). 

Proof:  Any  possible  execution  of  an  algorithm  on  the  restartable  fail-stop  model  can 
be  duplicated  by  an  appropriate  interleaving  on  the  strongly  asynchronous  model.  The 
argument  in  Theorem  4.5  works  even  if  failed  processors  do  not  lose  local  state,  and  so 
the  same  strategy  will  work  in  the  strongly  a.synchronous  model.  □ 

This  lower  bound  is  the  tightest  possible  bound  under  the  assumption  that  the 
processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost.  Although 
such  an  assumption  is  very  strong,  we  present  the  matching  upper  bound  for  two  reasons. 
First,  it  demonstrates  that  any  improvement  to  the  lower  bound  must  take  account  of 
the  fact  that  processors  can  read  only  a  constant  number  of  cells  per  update  cycle. 
Second,  it  presents  a  simple  processor  allocation  strategy  that  we  use  to  advantage  in 
Section  3.2.2  when  we  analize  algorithm  V. 

Theorem  4.7  If  processors  can  read  and  locsdly  process  the  entire  shared  memory  at 
unit  cost,  then  a  solution  for  the  Write-AH  problem  in  the  restartable  fail-stop  model 
can  be  constructed  such  that  its  completed  work  using  P  processors  on  input  of  size  N 
\s  S  =  N  -  P  A  0{PlogP),  when  P  <  N. 

Proof:  The  processors  follow  the  following  simple  strategy:  at  each  step  that  a  processor 
PID  is  active,  it  reads  the  N  elements  of  the  array  i[l..A]  to  be  visited.  Say  U  of  these 
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elements  are  stiU  not  visited.  The  processor  numbers  these  U  elements  from  1  toU  based 
on  their  position  in  the  array,  and  assigns  itself  to  the  tth  unvisited  element  such  that 
i  =  [P/D-p].  This  achieves  load  balancing  with  no  more  than  processors  assigned 
to  each  unvisited  element.  The  reading  and  local  processing  is  done  as  a  snapshot  at 
unit  cost. 

We  list  the  elements  of  the  Write-All  array  in  ascending  order  according  to  the 
time  at  which  the  elements  are  visited  (ties  are  broken  arbitrarily).  We  divide  this  list 
into  adjacent  segments  numbered  sequentially  starting  with  0,  such  that  the  segment  0 
contains  Vq  =  N  -  P  elements,  and  segment  j  >  1  contains  Vj  =  elements,  for 

j  =  l,...,Tn  and  for  some  m  <  y/P.  Let  Vj  be  the  least  possible  number  of  unvisited 
elements  when  processors  were  being  assigned  to  the  elements  of  the  jth  segment.  Uj 
can  be  computed  ss  Uj  =  N  —  Uo  is  of  course  N,  and  for  j  >  1,  f/j  = 

P  —  YliZl  ^  P —  "f  -  Therefore  no  more  than  processors  were  assigned 
to  each  element. 


The  work  performed  by  such  an  algorithm  is: 


5  < 

=  Vo  +  0 


j=0 


m  . 


<  vo+Er=i 


Li{i+i 

=  iV-P  +  C>(PlogF) 


d  \u 


Finally,  the  lower  bounds  in  this  section  apply,  just  as  the  results  for  the  non- 
restartable  model,  to  the  worst  case  work  of  the  deterministic  algorithms,  and  to  the 
expected  work  of  the  deterministic  and  the  randomized  algorithms  subject  to  the  worst 
case  dynamic  on-line  adversaries  (that  cannot  alfect  a  coin  toss  or  a  random  number 
selection). 


4.3  Other  bounds 

4.3.1  A  lower  bound  for  CREW  PRAM 


In  the  absence  of  failures,  any  P-processor  CREW  (concurrent  read  exclusive  write) 
or  EREW  (exclusive  read  exclusive  write)  PRAM  can  simulate  a  P-processor  CRCW 
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PRAM  with  only  a  factor  of  O(logP)  more  parallel  work  [58].  We  now  show  that  a 
more  severe  difference  exists  between  CRCW  and  CREW  PRAMs  (and  thus  also  EREW 
PRAMs)  when  the  processors  are  subject  to  failures. 

Theorem  4.8  Given  any  (deterministic  or  randomized)  iV-processor  CREW  PRAM 
algorithm  that  solves  the  Write-All  problem,  the  adversary  can  force  fail-stop  errors 
that  result  in  il{N'^)  steps  being  performed  by  the  algorithm,  even  if  the  processors  can 
read  and  locally  process  all  shared  memory  at  unit  cost. 

Proof:  To  prove  this,  we  first  define  an  auxiliary  Write-One  problem  as  follows:  Given 
a  scalar  variable  s  whose  value  is  initially  0,  store  1  in  this  variable. 

Let  B  be  the  most  efficient  asymptoticaUy  CREW  algorithm  that  solves  the  Write- 
One  problem  that  is  able  toread  and  process  all  shared  memory  at  unit  cost.  Such  an 
algorithm  is  no  more  efficient  asymptotically  than  the  algorithm  below  that  utilizes  an 
oracle  to  predict  the  best  selection  of  a  processor  that  is  chosen  to  exclusively  write  the 
value  1: 

forall  processors  PID=1..N  parbegin 
shared  integer  s; 
while  s  =  0  do 

if  PID  =  OracleQ  thens  :=  1  fi 
od 
parend 

To  exhibit  the  worst  case  behavior  the  adversary  fails  the  processor  that  was  selected 
by  the  Oracle{)  to  perform  the  exclusive  write  until  a  single  processor  remains.  The 
remaining  processor  is  then  allowed  to  write  1  by  the  adversary.  Clearly  5  =  fl{N^). 

Finally,  it  can  be  shown  by  a  simple  reduction  to  the  Write-One  problem  that  any 
A-processor  CREW  solution  to  the  Write-A// problem  has  the  worst  case  of  5  =  il{N"^). 
□ 


For  the  CREW  PRAMs,  Martel  and  Subramonian  show  a  randomized  Write-All 
algorithm  in  [73]  that,  when  adversary  is  oblivious  and  non-adaptive,  has  expected  work 
of  only  O(jVlogA)  using  N  processors,  and  expected  work  {N)  when  using  NflogN 
processors.  Thus  it  appears  that  adaptive  adversary  are  significantly  more  powerful 
than  the  oblivious  adversaries  as  far  as  randomized  CREW  algorithms  are  concerned. 
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4.3.2  Lower  bounds  with  test-and-set  operations 

Under  certain  assumptions  on  the  way  that  memory  is  accessed  in  the  strongly  asyn¬ 
chronous  model,  a  different  lower  bound  is  shown  by  Buss  et  al.  in  [27].  Assume  that, 
instead  of  atomic  reads  and  writes,  memory  is  accessed  by  means  of  test-and-set  oper¬ 
ations.  That  is,  memory  can  only  contain  zeroes  and  ones,  and  a  single  test-and-set 
operation  on  a  memory  cell  sets  the  value  of  that  cell  to  1  and  returns  the  old  value  of 
the  ceU. 

Theorem  4.9  [27]  Any  strongly  asynchronous  PRAM  algorithm  for  the  Write-All 
problem  which  uses  test-and-set  as  an  atomic  operation  requires  N  +  fi(Plog(A/P)) 
total  work,  for  P  >  3. 

This  lower  bound  can  be  applied  equally  well  if  the  atomic  operation  is  compare- 
and-swap,  or  to  any  set  of  atomic  read-modify-write  operations  where  the  read  and 
writes  are  constrained  to  be  to  the  same  cells.  The  result  also  applies  to  the  fail-stop 
restartable  model,  when  each  update  cycle  accesses  only  one  array  element  used  by  the 
Write- All  problem. 
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Chapter  5 

Algorithm  Simulations  and 
Transformations 


IN  THIS  chapter  we  develop  a  general  technique  for  efficient  simulation  of  any  PRAM 
algorithm  by  a  fail-stop  CRCW  PRAM  using  the  Write-All  paradigm.  We  formulate 
a  universal  PRAM  interpreter  (UPl),  then  develop  a  fault-tolerant  UPl thaX  can  execute 
any  PRAM  algorithm  by  storing  the  programs  and  processor  registers  in  shared  memory. 
Our  simulation  is  based  on  executing  individual  PRAM  computation  steps  using  the 
Write- All  paradigm  in  such  a  way  that  the  complexity  of  solving  a  A^-size  instance  of 
the  Write- All  problem  using  P  fail-stop  processors,  and  the  complexity  of  executing  a 
single  .V-processor  PRAM  step  on  a  fail-stop  P-processor  PRAM  are  equal. 

The  fault-tolerant  algorithms  are  executed  on  CRCW  PRAMs  whose  processors  are 
subject  to  fail-stop  errors.  As  we  have  shown  in  Chapter  4  on  lower  bounds,  the  CREW 
(concurrent  read,  exclusive  write)  model  is  not  sufficient  due  to  the  fact  that  very  simple 
adversaries  can  cause  at  least  a  quadratic  amount  of  work  for  any  Write-All  solution, 
even  if  the  restarts  are  not  allowed. 

We  will  show  that  in  the  no-restart  fail-stop  model  the  algorithms  that  can  be 
made  fault-tolerant  include  the  following  models:  weak  (only  zeros  can  be  written  con¬ 
currently),  COMMON  (concurrent  writes  of  identical  values  are  permitted),  ARBITRARY 
(some  single  processor  succeeds),  PRIORITY  (highest  numbered  processor  succeeds),  and 
STRONG  (processor  writing  the  largest  value  succeeds).  In  the  restartable  model,  we 
can  make  fault-tolerant  aU  of  the  above,  except  for  the  priority  PRAM  (for  detailed 
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surveys  of  PRAM  variations  see  [40,  58]). 

For  the  no-restart  model,  we  show  that  the  simulation  of  PRAM  algorithms  can  be 
done  using  optimal  work  in  the  presence  of  arbitrary  fail-stop  errors.  This  is  accom¬ 
plished  using  our  simulation  together  with  algorithm  W  that  is  optimal  on  inputs  of 
size  N  when  using  P  processors  such  that  W  is  in  its  range  of  optimality.  The  resulting 
fault- tolerant  execution  does  not  degrade  the  as  mptotic  efficiency  of  the  source  algo¬ 
rithm.  For  the  restartable  model  we  show  a  strategy  that  is  work-optimal  when  the 
number  of  simulating  processors  is  P  is  within  the  optimality  range  of  algorithm  V  and 
the  total  number  of  failures  per  each  simulated  N  processor  step  is  0{P\ogN). 

We  also  show  that  in  some  cases  it  is  possible  to  develop  fault-tolerant  algorithms 
that  improve  on  the  efficiency  of  general  but  “naive”  simulations  of  these  algorithms. 

Finally,  we  briefly  discuss  so  ..e  parallel  efficiency  classes  and  “closures”  with  respect 
to  fault  tolerance. 


5.1  General  Parallel  Assignment 

In  this  section  we  introduce  the  basic  techniques  that  are  going  to  be  used  in  the  PRAM 
simulations. 

Given  a  solution  for  the  Write- All  problem,  it  can  readily  be  used  as  a  building  block 
for  transforming  efficient  parallel  algorithms  into  robust  ones.  The  techniques  used  are 
illustrated  by  producing  a  robust  general  parallel  assignment  in  the  following  example: 

Example  5.1  General  parallel  assignment:  Consider  computing  and  storing  in  an  array 
a:[l..A]  the  values  of  a  function  /  that  depend  on  the  processor  numbers  PID  and  the 
initial  values  of  the  array  x.  Also,  for  simplicity,  assume  /  can  be  computed  in  0(1) 
sequential  time. 

forall  processors  PID  =  1..7V  parbegin 
shared  integer  array  a:[l..  n 
x[PID]  :=  f{PID,x[l..N]) 
parend 

We  convert  the  assignment  to  a  form  that  remains  correct  when  processors  fail  and 
when  multiple  attempts  are  made  to  execute  the  assignment,  assuming  there  are  means 
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for  reassigning  surviving  processors  that  have  accomplished  their  initial  task.  This  is 
done  using  binary  version  numbers  and  two  generations  of  the  array: 

forall  processors  PID  =  1..N  parbegin 
shared  integer  array  a:[0..1][l..A^]; 
bit  integer  v; 
xlv  +  1][P1D]  :=  f(PID, 

t)  :=  v  +  1 

parend 

Here,  v  is  the  current  bit  (mod  2)  version  number  (or  tag),  so  that  x[n][l  ...N]  is 
the  array  of  current  values.  Function  /  will  use  only  these  values  of  x  as  its  input.  The 
values  of  /  are  stored  in  x[v  +  1][1 . .  .N]  creating  the  next  generation  of  array  x.  After 
all  the  assignments  are  performed,  the  binary  version  number  is  incremented  (mod  2). 

At  this  point,  a  simple  transformation  of  a  solution  to  the  Write-All  problem,  with 
the  general  parallel  assignment  replacing  the  trivial  “x[i]  =  1”  assignment,  will  yield  a 
robust  A-processor  algorithm.  □ 

The  preceding  example  directly  yields  the  following: 

Proposition  5.1  The  asymptotic  work  complexities  of  solving  the  general  parallel  as¬ 
signment  problem  and  the  Write-All  problem  are  equal.  □ 

Similarly  to  the  general  parallel  assignment,  any  Write-All  solution  can  be  directly 
used  to  zero  an  array  of  size  N ,  copy  N  computed  values  from  one  array  to  another,  ro 
perform  any  other  single  step  PRAM  computations. 

Write- All  algorithms  usually  assume  that  an  amount  of  shared  memory  proportional 
to  either  the  number  of  processors  P  or  the  problem  size  N  is  available,  and  that  it  is 
initialized  to  zero  (we  relax  this  assumption  in  Section  6.1).  This  is  true  of  all  known 
Write- All  solutions  [8,  27,  55,  56,  59,  61,  75],  and  this  also  applies  to  algorithms  that 
can  be  adapted  to  serve  as  an  Write- All  solution,  e.g.,  [29].  Furthermore,  iterative  use 
of  the  Write- All  technique  in  a  single  algorithm  requires  that  the  heaps  contain  zeroes 
at  the  start  of  each  iteration.  This  can  be  accomplished  by  utilizing  three  identical  sets 
of  the  heaps.  One  is  used  in  the  main  algorithm,  another  set  of  heaps  is  used  for  zeroing 
the  other  two  in  a  Write- All  style,  and  the  third  set  is  saved  for  the  next  use  of  the 
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zeroing  clean-up  step.  This  step  does  not  affect  the  asymptotic  efficiency,  and  we  will 
assume  that  the  heaps  are  appropriately  cleared  in  the  rest  of  this  section.  We  are  going 
to  revisit  this  technique  in  more  detail  in  the  next  two  sections. 

5.2  A  PRAM  Interpreter 

PRAM  programs  are  normally  presented  as  high  level  programs  that  can  be  compiled 
into  assembly  level  instructions  using  conventional  techniques  (see  a  discussion  in  Wyl- 
lie’s  thesis  [102]).  As  is  the  case  with  sequential  processors,  the  instructions  are  stored 
in  memory,  the  address  of  an  instruction  to  be  executed  next  is  stored  in  an  instruction 
counter  register,  and  in  order  to  execute  an  instruction,  it  is  fetched  into  an  instruction 
buffer.  When  control  structures  such  as  while-do  and  if-then-else  are  used,  they  the 
branching  of  control  is  compiled  as  assignments  to  instruction  counters.  Other  processor 
private  memory  cells  are  stored  in  general  purposes  registers.  Here  we  will  use  a  formal¬ 
ization  of  the  definition  of  PRAM  such  as  the  one  used  by  Karp  and  Ramachandran  in 
[58].  Informally,  PRAM  instructions  consist  of  three  synchronous  cycles: 

1.  Read  cycle:  a  processor  reads  a  value  from  a  location  in  shared  memory  into 
private  memory, 

2.  Compute  cycle:  a  processor  performs  a  computation  using  private  memory, 

3.  Write  cycle:  a  processor  writes  a  value  from  a  location  in  private  memory  to  a 
location  in  shared  memory. 

To  formalize  PRAM  programs  that  are  specified  in  terms  of  a  PRAM  “machine 
language”,  we  formulate  a  definition  given  in  Figure  5.1  in  conjunction  with  the  code 
for  a  PRAM  interpreter.  The  simple  PRAM  program  in  Definition  5.1  in  the  figure 
implements  the  synchronous  computation  performed  by  a  PRAM. 

The  number  of  registers  per  processor,  /  =  |r|,  is  typically  constant  for  uni-processors, 
however  in  parallel  processing  it  is  important  to  provide  each  processor  with  larger  pri¬ 
vate  memories  in  order  to  allow  them  to  perform  as  much  computation  as  possible 
without  having  to  access  the  shared  memory.  We  will  consider  private  memories  of  sizes 
up  to  0(log*'  N)  for  some  constant  k.  This  does  not  diminish  the  computational  power 
of  the  model,  since  if  an  algorithm  assumes  a  larger  than  available  private  memory,  a 
dedicated  portion  of  shared  memory  can  be  used  by  such  algorithms. 
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01  forall  processors  PID=1../’  parbegin 

02  shared  SM[l..Af];  — shared  memory 

03  shared  PROG[l..P,l-  size];  — P  programs  of  length  size 

04  private  IB  — instruction  buffer 

05  private  r  record  IC,  RR,  WR,  ...  end  — private  registers 

06  r.IC  :=  1;  — start  at  the  first  instruction 

07  while  r.IC  5^  0  do  — while  not  HALT 

08  IB  :=  PROG[PID,r.IC];  — fetch  of  <R{)  C()  W()  J()>  into  IB 

09  r.RR  :=  SM[R(r)]; - read  cycle 

10  r  :=  C(r);  - compute  cycle 

11  SM[W(r)]  :=  r.WR;  - write  cycle 

12  r.IC  ;=  J(r);  — next  instruction 

13  od 

14  parend 


Definition  5.1  A  P-processor  PRAM  program  on  inputs  of  size  N  is  defined  as  follows: 


1 .  The  P  processors  have  unique  identifiers  PID  in  the  range  1  to  P. 

2.  Each  processor  has  an  instruction  buffer  IB,  and  a  set  of  internal  registers  col¬ 
lectively  referred  to  as  the  record  r.  The  number  of  registers  per  processor,  |r|, 
is  1.  The  registers  in  r  include  an  instruction  counter  IC,  a  read  register  RR  and 
a  write  register  WR  used  for  reads/writes  from/to  shared  memory. 

3.  Q  uses  shared  memory  cells  SM[l..m]  for  some  m,  with  the  first  N  cells  con¬ 
taining  the  input.  Shared  memory  cells  and  registers  are  capable  of  storing 
0(logmax{A^,  P})  bits  each. 

4.  Program  instructions  are  stored  in  a  shared  array  PR0G[l..P,l..si2e],  where  size 
is  a  constant.  Program  for  processor  i  is  in  PROG[i,l..si2:e]),  one  instruction  per 
array  element. 

5.  Instructions  consist  of  four  encoded  operations  <R()  C()  W()  J()>.  After  reading 
an  instruction  into  IB,  a  processor  interprets  this  code  as  foDows  (line  numbers 
refer  to  Figure  5.1): 

Read  cycle  (line  09):  read  into  RR  the  contents  of  shared  memory  location  R(r), 

Compute  cycle  (line  10):  assign  to  registers  in  r  the  result  of  computation  C(r), 

Write  cycle  (line  11):  write  the  contents  of  WR  to  shared  memory  location  W(r), 

Update  instruction  counter  (line  12):  compute  next  instruction  address;  IC  =  0  is 
a  halt. 


□ 


Figure  5.1:  Universal  PRAM  Interpreter. 
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Also  note  that  we  allow  for  processors  used  by  a  PRAM  program  to  reaJ  and  update 
the  entire  private  memory  of  size  I  in  unit  time.  When  simulating  PRAMs  on  fail-stop 
PRAMs,  the  fail-stop  processors  will  not  need  and  do  not  use  this  capability.  This 
allows  for  the  PRAM  programs  to  be  more  flexible,  without  imposing  any  restrictions 
on  the  fail-stop  processors  used  to  simulate  these  programs. 

In  Definition  5.1,  R(),  W()  and  J()  are  expressions  involving  registers,  and  C()  is 
the  code  for  individual  processors’  compute  cycles.  PRAM  programs  are  executed  by 
the  UPI  that  accepts  a  PRAM  program  and  its  input  as  data. 

5.3  General  Simulations  on  Fail-stop  Processors  Without 
Restart 

We  say  that  a  PRAM  algorithm  is  fault-tolerant  if  it  completes  its  tzisk  in  the  pres¬ 
ence  of  arbitrary  fail-stop  errors.  We  say  that  a  PRAM  algorithm  is  simulated  by  a 
fault-tolerant  PRAM  algorithm  when  both  algorithms  exhibit  identical  input/output 
behavior.  Such  simulation  is  robust  if  it  is  efficient,  and  it  is  optimal  if  the  efficiency  of 
the  simulating  algorithm  (measured  as  work)  is  within  a  constant  factor  of  the  efficiency 
of  the  simulated  algorithm: 

Definition  5.2  Let  Ad  be  a  simulation  of  a  P-processor  PRAM  algorithm  Q,  whose 
parallel-time  on  inputs  of  size  N  is  less  than  or  equal  to  t{N),  by  a  fault-tolerant 
P'-processor  algorithm  Q'. 

(1)  Ad  is  robust,  if  Q'  has  5  =  0{P  •  t{N)  •  log*^  A)  for  a  constant  c,  and 

(2)  Ad  is  optimal,  if  Q'  has  5  =  0{P  •  t{N)).  □ 

Remark  5.1  Note  that  a  robust  simulation  of  an  algorithm  is  not  the  same  as  a,  ro¬ 
bust  algorithm.  This  is  because  robustness  is  defined  in  relation  to  the  best  sequential 
algorithm,  and  not  in  relation  to  the  work  of  an  arbitrary  parallel  algorithm  which  may 
be  inefficient.  However  a  robust  simulation  of  an  efficient  algorithm  yields  a  robust 
algorithm. 

Finally,  it  is  relatively  easy  to  construct  robust  simulations  using  a  small  number  of 
processors,  e.g.,  using  P  processors  to  simulate  N  processors  when  P  is  polylogarithmic 
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in  N .  We  show  next  that  this  is  possible  for  P  in  1  <  P  <  iV,  and  optimally  so  for  P 
in  1  <  P  <  N/\og^N. 

The  UPI  'in  Figure  5.1  models  arbitrary  computation  performed  by  a  PRAM,  and 
in  this  section  we  develop  an  efficient  fault-tolerant  UPI. 

Let  us  define  5u,(X,  P)  to  be  the  cost  of  solving  the  Write-All  problem  of  size  X , 
using  P  {P  <  X)  processors.  We  measure  this  cost  as  5,  and  we  wiU  make  use  of  the 
Property  2.5  that  states  that  Sw{X.,  P')  <  S^{X.,  P)  when  P'  <  P. 

Now  the  main  lemma. 

Lemma  5.1  Let  Q  be  a  P-processor  COMMON  (arbitrary)  CRCW  PRAM  algorithm 
that  uses  m  shared  and  I  =  0(Plog*^  N)  (constant  k)  local  memory  on  inputs  of  size  N . 
Q  can  be  simulated  on  a  P'-processor  (P^  <  P)  fail-stop  COMMON  (arbitrary)  CRCW 
PRAM  using  0(m  -f  1)  shared  memory,  at  the  cost  of  5  =  0{Syj{P,P')  •\og'‘  N)  per 
paraDel  PRAM  step. 

Proof:  The  desired  result  is  achieved  by  constructing  a  robust  version  of  UPI  of  Fig¬ 
ure  5.1.  Figure  5.2  illustrates  an  intermediate  step  of  the  construction,  and  Figure  5.3 
is  the  final  pseudo-code  for  the  robust  UPI.  The  proof  consists  of  the  following  six 
steps:  (1)  local  memory  is  stored  in  shared  memory,  (2)  shared  memory  is  split  into 
two  generations,  “current”  and  “future”,  (3)  PRAM  step  computations  are  changed  to 
use  current  memory  as  input  and  use  future  as  output  and  sis  a  computation  scratch¬ 
pad,  (4)  current  and  future  memories  are  reconciled  to  produce  new  current  memory, 
(5)  instruction  counters  are  examined  to  detect  termination,  and  finally,  (6)  each  of  the 
modified  groups  of  actions  is  placed  in  the  work  phase  of  a  Write-All  algorithm.  Now 
the  details. 

First,  we  have  to  store  processor  local  memory  in  shared  memory,  since  after  pro¬ 
cessors  fail,  the  local  memory  is  lost.  We  observe  that  /  local  memory  cells  can  be 
stored  in  shared  memory  without  affecting  program  semantics.  To  do  this,  registers  are 
subscripted  by  PID  (lines  04-05,  all  lines  refer  to  Figure  5.2). 

We  then  separate  all  shared  memory  and  registers  into  two  generations:  current 
(using  subscript  0)  and  future  (subscript  1)  (lines  02  and  05).  All  memory  references 
are  now  made  using  the  generation  subscript  0  or  1  (lines  10-19).  This  does  not  affect 
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01  forall  processors  PID=1..F  parbegin 

02 

shared  SM[0..1,l..Af]; 

- two  generations  of  shared  memory 

03 

shared  PROG[l..P,l..size];  — P  programs  of  length  size;  one  generation 

04 

shared  1B[1..P]; 

—  instruction  buffers,  shared 

05 

shared  r[0..1,l..P]  record  IC,  RB,  WB,  ...  end  - 

—  two  sets  of  shared  registers 

06 

— Initialize  instruction  counters 

07 

r[0,PlD].IC  :=  1;  — start  at  the  first  instruction 

08 

while  r[0,PlD].IC  ^  0  do  — while  not  HALT 

09 

—  Tentative  computation 

10 

IB[pid] 

=  PROG[pid,i[0,pid].IC]; 

— fetch  <R()  C()  W()  J()> 

11 

r[l,PlD] 

=  r[0,PiD]; 

- copy  registers  to  scratchpad 

12 

r[l,PlD].RB 

=  SM[0,R(r[l,PlD])]; 

—  read  cycle 

13 

r[l,PlD] 

=  C(r[l,PlD]); 

—  compute  cycle 

14 

SM[l,[W(r[l,PiD])] 

=  r[l,PlD].WB; 

- write  cycle 

15 

r[l,PID].IC 

=  J(r[l,PlD]); 

- next  instruction 

16 

- Reconcile  shared  memory 

17 

SM[0,W(r[l,PiD])] 

=  SM[l,W(r[l,PlD])]; 

18 

—  Reconcile  registers 

19 

r[0,PID] 

=  r[l,PlD] 

20 

od 

21  parend 

Figure  5.2;  Modified  f/F/ using  two  generations  of  shared  memory. 


the  asymptotic  memory  requirement  of  size  0(m  +  /).  This  is  done  to  assure  that  the 
memory  that  can  be  accessed  by  processors  that  have  not  yet  completed  a  particular 
action  (due  to  failures)  is  not  changed.  This  allows  the  PRAM  instructions  to  be 
restarted  when  active  processors  are  re-allocated. 

Next  we  group  statements  into  four  actions  to  compute  future  memory  values  and  to 
reconcile  current  and  future  memories.  The  initialization  of  ICs  takes  place  in  line  07. 
The  contents  of  the  while  loop  of  Figure  5.1  are  grouped  into  3  actions:  the  action 
on  bnes  09-15  performs  the  tentative  PRAM  step  computation  using  current  memory 
as  input,  and  future  memory  as  output  and  as  a  scratchpad;  the  action  on  lines  16-17 
reconciles  shared  memory;  and  the  action  on  lines  18-19  reconciles  processor  registers. 

Now  a  Write-All  algorithm  is  used  with  each  of  the  four  actions  as  work  phases, 
see  Figure  5.3,  steps  a,  b,  c  and  d.  This  assures  that  all  actions  of  a  given  phase  are 
performed  before  any  of  the  actions  of  the  next  phase  are  attempted. 

Algorithm  W  utilizes  workspace  memory  of  size  0{P)  on  inputs  of  size  P  using 
P'  processors  (P'  <  P).  This  memory  initially  contains  zeroes.  Consecutive  use  of 
the  algorithm  W  in  a  single  algorithm  requires  that  this  workspace  is  zeroed.  This  is 
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forall  processors  PID=1..P'  parbegin 

shared  PROG[...].  SM[...),  r[...],  HI,  H2,  H3,  HALT  initial  false; 

a;  Initialize  ICs  to  1  using  HI;  Clear  HI  using  H3; 
while  not  HALT  do  — while  not  all  processors  kalte  d 
b:  Perform  a  tentative  PRAM  step  using  El; 

Clear  HI  and  H3  using  H2; 
c:  Reconcile  shared  memory  using  HI; 

Clear  HI  and  H2  using  H3; 
d:  Reconcile  registers  using  HI; 

Clear  HI  and  H3  using  H2; 

e:  Compute,  using  HI,  HALT=/o/sc  iff  3PID  such  that  1C  /  0; 
Clear  HI  and  H2  using  H3; 
od 
parend 


Figure  5.3;  Pseudo-code  for  a  robust  UPl. 


accomplished  by  utilizing  three  interchangeable  workspaces,  cal]  them  HI,  H2  and  H3. 
Hi  is  used  in  the  actual  algorithm.  H2  is  used  in  zeroing  HI  and  H3  using  a  Write- 
All  algorithm.  H3  is  saved  for  the  next  use  of  the  zeroing  cleanup  step.  H2  and  H3 
then  alternate.  This  simple,  but  modular  cleanup  technique  affects  neither  the  overall 
asymptotic  memory  usage,  nor  the  asymptotic  efficiency  of  robust  algorithms  when  the 
cleanup  stages  are  interleaved  with  the  computation  stages  (Figure  5.3). 

Lastly,  we  need  to  check  for  the  algorithm  termination  by  verifying  that  all  processors 
halted.  This  is  done  by  computing  shared  HALT  to  be  true,  iff  for  all  PIDs,  IC=0.  This 
can  be  accomplished  in  unit  time  on  a  CRCW  PRAM  in  the  absence  of  failures.  Here  we 
compute  HALT  as  follows:  it  is  initialized  to  true,  then  using  the  Write-All  technique, 
processors  examine  the  simulated  ICs  and  execute  “HALT:=/a/se”  for  each  1C  ^  0 
(Figure  5.3,  step  e). 

In  this  simulation,  due  to  failures,  the  synchronous  tentative  PRAM  step  computa¬ 
tions  may  occur  asynchronously  (in  the  sense  that  the  synchronous  PRAM  steps  of  the 
simulated  processors  can  be  performed  at  different  times  by  the  simulating  processors). 
However  since  no  processor  reads  or  writes  registers  of  other  processors,  "’nd  since  the 
program  PROG[...]  is  either  COMMON  or  ARBITRARY  CRCW,  it  does  not  matter  in 
what  order  the  shared  memory  is  written  by  the  simulated  processors.  Therefore  the 
simulated  PRAM  steps  will  write  values  to  shared  memory  in  a  way  that  is  consistent 
with  both  the  common  and  arbitrary  models. 
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To  analyze  the  complexity  of  the  resulting  fault-tolerant  computation,  we  first  ex¬ 
amine  the  use  of  registers.  In  tentative  computation  and  reconciliation  of  registers  in 
Figure  5.2,  each  PRAM  processor  may  need  to  compute  the  values  of,  and  copy  the 
shared  memory  that  represents  a  processor’s  registers  as  the  result  of  interpreting  tne 
computation  C().  This  must  be  done  sequentially  by  each  processor,  thus  incurring  a 
polylogarithmic  in  N  multiplicative  overhead.  However  this  overhead  still  falls  within 
the  definition  of  robustness  (Definition  2.6). 

The  complexity  of  applying  this  technique  to  a  single  PRAM  step  is  bounded  by  the 
complexity  of  solving  the  Write-All  problem  in  steps  a,  c  and  e  of  Figure  5.3,  plus  the 
complexity  of  robustly  writing  and  copying  I  =  0(P  \og^  N)  shared  memory  in  steps  b 
and  d.  To  copy  I  memory,  we  apply  the  Write-All  technique  l/P  times  at  the  cost  of 
S^{P,P')  per  appbcation.  Therefore,  the  total  cost  per  single  PRAM  step  is 

5  =  0[SUP,  n  +  {l/P)Su.iP,  P')]  =  P')  +  log*=  N  ■  S^{P,  P’)] 

=  0{Su,iP,P')-\og'^N). 

a 


A  construction  for  PRAM  simulation,  similar  to  the  one  used  in  Lemma  5.1  was 
independently  developed  by  Kedem  et  al.  in  [59]  using  the  Write-All  technique  to  sim¬ 
ulate  any  COMMON  or  ARBITRARY  CRCW  PRAM  algorithm  that  uses  arbitrary  shared 
and  no  local  memory.  That  construction  does  not  allow  any  local  memory,  while  we 
allows  up  to  poly-log  local  memory.  Note  that  we  allow  the  processors  to  update  the 
local  memory  in  unit  time.  This  makes  our  simulation  nominally  more  general  because 
the  use  of  local  memory  is  not  mandated  but  is  allowed  up  to  a  certain  limit,  while  [59] 
does  not  allow  any  local  memory.  In  [59]  it  is  also  observed  that  since  P  processors  can 
only  read  P  and  write  P  shared  memory  cells  in  a  single  PRAM  step,  the  overhead  in 
shared  memory  need  only  be  0(P)  when  processors  have  no  local  memory.  The  results 
that  follow  do  not  take  advantage  of  this  optimization. 


Theorem  5-2  Any  F-processor  PRAM  (EREW,  CREW,  and  WEAK,  COMMON,  ARBI¬ 
TRARY,  PRIORITY  or  STRONG  CRCW)  algorithm  that  uses  arbitrary  shared  and  poly¬ 
logarithmic  in  the  input  size  local  memory  can  be  robustly  simulated  on  a  fail-stop 
F'-processor  COMMON  or  ARBITRARY  CRCW  PRAM,  when  P'  <  P. 
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Proof;  We  use  any  robust  Write-All  solution,  e.g.,  algorithm  W.  For  COMMON  or 
ARBITRARY  CRCW  PRAMs,  Theorem  3.5  establishes  Sw{P,P')  =  O(Plog^P).  Then 
the  proof  follows  from  Lemma  5.1  and  Definition  5.2(1). 

Correct  EREW  (exclusive  read,  exclusive  write),  CREW  and  weak  CRCW  PRAM 
programs  can  be  directly  executed  on  a  CRCW  PRAM  without  changing  the  program 
semantics.  Therefore  the  technique  of  Lemma  5.1  can  also  be  used  with  CREW  and 
EREW  algorithms  that  are  to  be  executed  on  a  CRCW  PRAM.  Thus  the  result  holds 
for  EREW,  CREW,  and  WEAK  CRCW  models. 

To  extend  the  simulation  to  stronger  CRCW  models  such  as  PRIORITY  and  strong, 
we  use  an  efficient  algorithm  transformation  technique  that  preserves  algorithms’  effi¬ 
ciency  to  within  a  logarithmic  in  the  number  of  processors  factor  as  shown  in  [40] 
(attributed  to  folklore):  “A  parallel  computation  that  can  be  performed  in  time  r  on 
a  P-processor  strong  CRCW  PRAM,  can  also  be  performed  in  time  rlogP  using  P 
EREW  processors”.  Therefore  we  first  preprocess  priority  and  strong  PRAM  algo¬ 
rithms,  and  then  robustly  simulate  the  transformed  algorithms  on  fail-stop  COMMON  or 
ARBITRARY  CRCW  PRAMs.  □ 

To  achieve  optimal  simulation,  we  need  to  use  an  optimal  Write-All  solution.  For  the 
purposes  of  the  proof  we  are  going  to  use  the  algorithm  W  within  its  range  of  optimality: 

Theorem  5.3  Any  P-processor  PRAM  algorithm  that  uses  arbitrary  shared  and  con¬ 
stant  local  memory  per  processor  can  be  optimally  simulated  on  a  fail-stop  P'-processor 
CRCW  PRAM,  when  P'  <  P/log^  P.  EREW,  CREW,  and  weak  and  common  CRCW 
PRAM  algorithms  are  simulated  on  fail-stop  common  CRCW  PRAMs;  Arbitrary, 
PRIORITY  and  strong  CRCW  PRAMs  are  simulated  on  fail-stop  CRCW  PRAMs  of 
the  same  type. 

Proof;  For  the  optimality  result  we  use  Theorem  3.7  that  establishes  5u,(P,  P/  log^  P)  = 
0(P).  We  also  eliminate  the  overhead  of  copying  local  memories  by  allowing  constant 
local  memory  per  processor.  In  this  case  the  cost  of  copying  local  memory  is  absorbed 
by  the  cost  of  optimal  simulation  of  PRAM  instructions  by  using  Lemma  3.1  with  k  =  0 
(constant  local  memory).  This  proves  the  result  for  for  EREW,  CREW,  and  weak, 
common  and  arbitrary  CRCW  PRAMs. 


96 


CHAPTER  5.  ALGORITHM  SIMULATIONS  AND  TRANSFORMATIONS 


To  prove  the  result  for  priority  PRAM,  we  need  to  show  that  (1)  when  two  or  more 
PRIORITY  processors  write  concurrently,  then  the  processor  that  simulates  the  highest 
numbered  processor  will  succeed,  and  (2)  lower  numbered  processors  do  not  overwrite 
cells  written  by  higher  numbered  processors  during  several  phases  of  the  single  PRAM 
step  simulation. 

To  show  (1)  it  is  sufficient  to  demonstrate  that  the  simulation  has  the  processor 
allocation  monotonicity  property.  This  property  is  defined  as  follows:  if  PRAM  steps 
of  processors  with  PIDs  pi  and  P2  (without  loss  of  generality  let  p\  <  P2)  are  simulated 
respectively  by  processors  with  PIDs  pj  and  P2,  then  pj  <  P2.  This  is  assured  when 
using  algorithm  W  (or  algorithm  V)  as  the  Write-All  solution,  since  the  algorithm  has 
this  property  as  we  have  shown  in  Section  3.2.3. 

Property  (2)  can  be  assured  using  auxiliary  storage  as  in  the  transformation  in 
Eppstein  and  Galil  [40]:  before  writing,  processors  first  write  the  PID  of  the  simulated 
processor  and  then  the  data,  but  only  if  the  previously  written  PID  is  lower. 

For  the  simpler  case  of  strong  PRAM,  the  concurrent  writes  are  properly  handled 
by  zeroing  the  future  memory  cells  to  which  processors  will  write,  and  then  performing 
writes  only  if  the  value  in  the  cell  is  smaller  than  the  value  to  be  written  (without  loss  of 
generality  use  unsigned  integers).  This  assures  the  correctness  of  asynchronous  writes. 
Synchronous  writes  are  properly  handled  by  the  strong  PRAM  itself  since  regardless 
of  the  processor  used,  the  writes  of  larger  values  will  succeed.  □ 


5.4  General  Simulations  on  Restartable  Fail-Stop  Pro 
cessors 


We  now  extend  the  results  presented  in  the  previous  section  to  the  restartable  fail-stop 
model.  We  begin  by  formally  stating  the  main  result  for  a  deterministic  simulation  of 
any  A-processor  synchronous  PRAM  on  P  restartable  fail-stop  processors  (P  <  A), 
and  then  discuss  its  proof. 
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Theorem  5.4  Any  A-processor  PRAM  algorithm  can  be  executed  on  a  restartable 
fail-stop  P-processor  CRCW  PRAM,  with  P  <  N.  Each  iV-processor  PRAM  step  is 
executed  in  the  presence  of  any  pattern  F  of  failures  and  restarts  of  size  M  with: 

•  completed  work:  =  C)(min{A  -1-  Plog^  N  +  M log  A,  N  • 

•  overhead  ratio:  <r  =  O(log^  N). 

EREW,  CREW,  and  weak  and  common  CRCW  PRAM  algorithms  are  simulated 
on  fail-stop  COMMON  CRCW  PRAMs;  Arbitrary  and  strong  CRCW  PRAMs  are 
simulated  on  fail-stop  CRCW  PRAMs  of  the  same  type.  □ 

Remark  5.2  Priority  CRCW  PRAMs  cannot  be  directly  simulated  using  the  same 
framework,  for  one  of  the  aJgorithms  used  (namely  algorithm  X  in  Section  3.3.1)  does 
not  possess  the  processor  allocation  monotonicity  property  that  assures  that  higher 
numbered  processors  simulate  the  steps  of  the  higher  numbered  original  processors. 

In  the  previous  section  we  had  shown  in  Lemma  5.1  that  the  complexity  of  solving 
a  A-size  instance  of  the  Write-All  problem  using  P  fail-stop  processors  is  equal  to  the 
complexity  of  executing  a  single  A-processor  PRAM  step  on  a  fail-stop  F-processor 
PRAM.  That  result  also  holds  in  the  restartable  fail-stop  model,  since  the  proof  of 
Lemma  5.1  does  not  utilize  the  knowledge  that  there  are  no  restarts. 

Here  we  describe  how  algorithms  V  and  X'  are  combined  with  the  framework  we 
have  established  to  yield  efficient  executions  of  PRAM  programs  on  PRAMs  that  are 
subject  to  stop-failures  and  restarts. 

Theorem  5.5  There  exists  a  Write-All  solution  using  P  <  N  processors  on  instances  of 
size  A  such  that  for  any  pattern  F  of  failures  and  restarts  with  |F|  <  M,  the  completed 
work  is  S"^  =  0(min{  A  +  Flog^  A  -f  M  log  A,  A  •  }),  and  the  overhead  ratio  is 

<T  =  0(log^  A)  . 

Proof:  The  executions  of  algorithms  V  and  X’  can  be  interleaved  to  yield  an  algorithm 
that  achieves  the  performance  as  stated.  The  completed  work  complexity  is  asymptot¬ 
ically  equal  to  the  minimum  of  the  completed  work  performed  by  V  and  X'.  This  is 
because  the  number  of  cycles  performed  by  each  algorithm  in  the  interleaving  differs 
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by  at  most  a  multiplicative  constant.  The  overhead  ratio  is  directly  inherited  from 
algorithm  V  by  the  same  reasoning  because  of  the  Definition  2.9  of  a  and  S'*'.  □ 

Application  of  the  simulation  techniques  from  the  previous  section  in  conjunction 
with  the  algorithms  V  and  X'  yield  efficient  and  terminating  executions  of  any  non¬ 
fault-tolerant  PRAM  programs  in  the  presence  of  arbitrary  failure  and  restart  patterns. 
Theorem  5.4  follows  from  Theorem  5.5,  Lemma  5.1  and  Theorem  5.2.  The  following 
corollaries  are  also  interesting; 

Corollary  5.6  Under  the  hypothesis  of  Theorem  5.4,  and  if  |F|  <  P  <  N,  then: 

S  =  0{N  -f-  Plog2  A),  and  =  0(log^  N). 

The  fail-stop  (without  restarts)  behavior  of  the  combined  algorithm  is  subsumed  by 
this  corollary.  The  next  result  gives  additional  insight  into  the  efficiency  of  our  solution: 

Corollary  5.7  Under  the  hypothesis  of  Theorem  5.4: 

•  when  (FI  is  fl(Alog  A),  then  <t  is  O(logA), 

•  when  |F|  is  then  a  is  0(1). 

Thus  the  overhead  efficiency  o  of  our  algorithm  actually  improves  for  large  failure 
patterns.  These  results  also  suggest  that  it  is  harder  to  deal  efficiently  with  a  few  worst 
case  failures  than  with  a  large  number  of  failures. 

Our  next  corollary  demonstrates  a  non-trivial  range  of  parameters  for  which  the 
completed  work  is  optimal,  i.e.,  the  work  performed  in  executing  a  parallel  algorithm 
on  a  faulty  PRAM  is  asymptotically  equal  to  the  Parallel-time  X  Processors  product  for 
that  algorithm. 

Corollary  5.8  Any  A-processor,  r-time  PRAM  algorithm  can  be  executed  on  a  P  < 
A/log^  A  processor  fail-stop  CRCW  PRAM,  such  that  when  during  the  execution  of 
each  A-processor  step  of  that  algorithm  the  total  number  of  processor  failures  and 
restarts  is  0(A/log  A),  then  the  completed  work  is  5  =  0{t  ■  A). 

Of  course  it  is  also  true  that  optimality  is  preserved  in  the  absence  of  failures  or 
when  during  the  execution  of  each  A  processor  step  there  are  O(logA)  failures  and 
restarts  per  each  simulating  processor. 
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5.5  Improving  Oblivious  Simulations 

In  addition  to  serving  as  the  basis  for  oblivious  simulations,  any  solution  for  the  Write- 
All  problem  can  also  be  readily  used  as  a  building  block  for  custom  transformations  of 
efficient  parallel  algorithms  into  robust  ones  [55].  Custom  transformations  are  interest¬ 
ing  because  in  some  cases  it  is  possible  to  improve  on  the  work  of  the  naive  oblivious 
simulation.  These  improvements  are  most  significant  for  fast  algorithms  when  a  full 
range  of  processors  is  used,  i.e.,  when  N  are  used  to  simulate  N  processors,  because  in 
this  case  the  parallel  slack  cannot  be  taken  advantage  of.  For  example  in  the  models 
with  clear  initial  memory,  a  factor  of  log  log  log  N  was  saved  off  the  pointer  doubling 
simulations  [5.5],  and  using  randomization  and  off-line  adversaries,  improvements  can 
be  obtained  in  expected  work  of  other  algorithms  [72,  75]. 

Using  the  general  simulation  techniques,  such  as  [59,  75,  92],  if  S^{N,P)  is  the 
efficiency  of  solving  a  Write-All  instance  of  size  N  using  P  processors,  then  a  single  N- 
processor  PRAM  step  can  be  simulated  using  P  fail-stop  processors  and  work  Su;iN,  P). 
Thus  if  the  Parallel-time  x Processors  of  an  original  A^-processor  algorithm  is  t  •  N, 
then  the  work  5  of  the  fault-tolerant  version  of  the  algorithm  will  be  no  better  than 
0{t.S^.{N,P)). 

One  immediate  result  that  improves  on  the  general  simulations  follows  from  the  fact 
that  algorithms  V',  W  and  A',  by  their  definition,  implement  an  associative  operation 
on  .N  values. 

Proposition  5.2  Given  any  associative  operation  0  on  integers,  and  an  integer  array 
i[l..A],  it  is  possible  to  robustly  compute  x[i]  using  P  fail-stop  processors  at  a 
cost  of  of  a  single  application  of  any  of  the  algorithms  V,  IV  or  A. 

This  saves  a  full  log  N  factor  for  zdl  simulations.  The  savings  are  also  possible  for 
the  important  prefix  sums  and  pointer  doubling  algorithms. 

5.5.1  Parallel  prefix 

We  now  show  how  to  obtain  deterministic  improvementr  in  work  for  the  prefix  sums 
algorithm  that  occurs  in  solutions  of  several  important  problems  [21].  Efficient  parallel 
algorithms  and  circuits  for  computing  prefix  sums  were  given  by  Ladner  and  Fischer  in 
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[64],  where  the  prefix  problem  is  defined  as  follows:  Given  an  associative  operation  0  on 
a  domain  P,  and  xi, . .  .,Xn  G  P,  compute,  for  each  k,  {I  <  k  <  n)  the  sum  0*Li  x,. 

In  order  to  compute  the  prefix  sums  of  N  values  using  N  processors,  at  least 
log  fV/ log  log  parallel  steps  are  required  [20,  67],  and  the  known  algorithms  require 
at  least  logfV  steps.  Therefore  an  oblivious  simulation  of  a  known  prefix  algorithm 
will  require  simulating  at  least  log  N  steps.  When  using  P  =  N  processors,  the  work 
of  such  simulation  will  be  0{Syj  -  log  TV).  Here  we  extend  Proposition  5.2  and  show  a 
robust  prefix  sum  algorithm  whose  v  fk  complexity  is  0(Sw),  thus  improving  oblivious 
deterministic  simulation  by  a  factor  of  log  AT. 

In  the  no- restart  fail-stop  model  we  have  the  following  result: 

Theorem  5.9  Parallel  prefix  for  N  values  can  be  computed  using  N  non-restartable 
fail-stop  processors  usi  <g  0{N)  clear  memory  with  S  =  0{N]og^  N/loglogN). 

Proof:  The  prefix  summation  algorithm  that  we  are  going  to  use  as  the  basis,  is  an 
iterative  version  of  the  recursive  algorithm  of  [64].  The  algorithm  consists  of  two  stages: 
(1)  first  a  binary  summation  tree  is  computed,  (2)  then  the  individual  prefix  sums  are 
computed  from  the  summation  tree  obtained  in  the  first  stage.  Each  prefix  sum  requires 
no  more  than  logarithmic  number  of  additions. 

Each  stage  can  be  performed  in  logarithmic  time  in  parallel  by  up  to  N  processors. 
To  produce  the  robust  version  of  the  above  algorithm,  we  use  algorithm  W  twice  to 
implement  these  two  stages.  Fir  each  stage  the  controls  of  algorithm  W  areused  with 
appropriate  modifications  as  follows: 

1.  A  binary  summation  tree  is  computed  in  bottom  up  traversals  at  the  same  time 
when  the  progress  tree  of  algorithm  W  is  being  updated.  This  modification  to  the 
algorithm  does  not  affect  its  asymptotic  complexity. 

2.  This  stage  uses  the  work  phase  of  algorithm  W  modified  to  include  the  logarithmic 
time  summ  ..ion  operations  using  the  tree  computed  in  stage  1. 

In  the  code,  shown  in  Figure  5.4,  {{i))  is  a  binary  string  representing  the  value  i 
in  binary,  where  most  significant  bit  is  bit  number  0,  and  {{i))h  is  the  true/false 
value  of  the  /i**  most  significant  bit  of  the  binary  string  representing  i.  The  loop 
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01  forall  processors  PID  =  0..N  parbegin 

02  shared  integer  array  sum[1..2Af  —  1];  — summation  tree 

03  shared  integer  array  prefix[l..N]-,  — prefix  sums 

04  private  integer  — current /left/righi  indices 

05  A;  — depth  in  the  summation  tree 

06  j  :=  1;  — begin  at  the  root, 

07  h  :=  0; - and  at  depth  0 

08  prefix[PID]  :=  0; - initialize  the  sum 

09  while  A  ^  0  do  — traverse  from  root  to  leaf 

10  A  :=  A  +  1 

11  jl:=2*j  — left  index 

12  j2  :=  jl  +  1  — right  index 

13  if  {{PID))h  — Is  the  sub-sum  at  this  level  included? 

14  then  pre/ta:[P/D]  :=  prefix[PID]  +  stim[yi]  — add  the  left  sub-sum 

15  j  :=  j2  — go  down  to  the  right 

16  else  j  :=  jl - go  down  to  the  left 

17  fi ; 

18  od 

19  parend 


Figure  5.4;  Second  stage  of  robust  prefix  computation. 


in  lines  09-18  is  the  top-down  traversal  of  the  summation  tree.  In  lines  13-17  the 
appropriate  subtree  sum  is  added  (line  14)  at  depth  h  only  if  the  corresponding 
bit  value  of  the  processor  PID  is  true. 


□ 


Note  that  because  of  the  lower  bounds  of  Beame  and  Hastad  [20]  and  Li  and  Yesha 
[67],  at  least  log  Y/loglogiV  parallel  time  and  at  least  JVlogiV/loglogiV  work  will  be 
required  hy  P  =  N  processors  to  compute  the  prefix  sums  in  the  absence  of  failures. 
Therefore  the  multiplicative  overhead  in  work  of  our  parallel  prefix  algorithm  is  only 
log  N  when  using  alogirthm  W  in  the  fail-stop  model. 


5.5.2  Pointer  doubling 

Another  important  improvement  for  the  fail-stop  case  is  a  robust  pointer  doubling  oper¬ 
ation  that  is  a  basic  building  block  for  many  parallel  algorithms.  This  is  accomplished 
using  algorithm  lY,  the  most  efficient  to  date  algorithm  for  the  fail-stop  model. 
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01  forall  processors  P1D=1..N  parbegin 


02 

03 

04 

05 

06 

07 

08 

09 

10 

11 

12 


for  l..log(N)/loglog(N)  do 

— Perform  a  single  stage  to  double  each  pointer  at  least  log(N)  times 
Phase  W3:  Each  processor  doubles  its  leaf’s  pointer  log(N)  times. 

Phase  W4:  Bottom  up  traversal  to  (under)estimate  the  no.  of  leaves  visited 
while  the  underestimate  of  the  visited  leaves  is  not  N  do 


od 


od 


Phase  Wl: 
Phase  W2: 
Phase  W3: 
Phase  W4: 


Perform  bottom  up  traversal  to  enumerate  remaining  processors 

Perform  top  down  traversal  to  reschedule  work 

Each  processor  doubles  its  leaf’s  pointers  log(N)  times. 

Perform  bottom  up  traversal  to  measure  progress  made 


13  parend 


Figure  5.5;  A  high  level  view  of  the  robust  pointer  doubling  algorithm 


Proposition  5,3  There  is  a  robust  list  ranking  algorithm  for  the  fail-stop  model  with 
5*  =  0(log7V  •  Siu(N,  P)/\og\ogN),  where  N  is  the  input  list  size  and  5u,(iV,  P)  is  the 
complexity  of  algorithm  W  for  the  initial  number  of  processors  Pi  1  <  P  <  N . 


Proof:  The  robust  algorithm  is  implemented  using  a  variation  of  algorithm  W,  gen¬ 
eral  parallel  assignment,  and  the  standard  pointer  doubling  algorithm.  A  high  level 
algorithm  description  is  given  in  Figure  5.5,  where  Phases  Wl  through  W4  refer  to  the 
phases  of  algorithm  W. 

We  associate  each  list  element  with  a  progress  tree  leaf.  Phase  W3  uses  the  general 
parallel  assignment  approach  to  double  pointers  and  update  distances.  As  before,  we 
use  two  generation  of  the  arrays  representing  the  pointers,  and  the  running  ranks  of  the 
list  elements  —  one  is  used  as  the  “current”  values,  and  the  other  ais  the  “next”  values 
being  computed.  Binary  tags  are  used  to  determine  which  generation  is  “current”,  and 
which  is  “next”.  The  generations  alternate  as  the  computation  progresses. 

The  log  N  pointer  doubling  operations  in  phase  W3  makes  it  sufficient  for  the  out¬ 
most  for  loop  to  iterate  log  N/  log  log  N  times,  and  it  does  not  affect  the  complexity  of 
the  approach  in  algorithm  W,  since  Phases  Wl,  W2  and  W4  take  0(log  A)  time.  This 
results  in  5  =  Oi^^^S^P,  N)).  □ 

The  technique  of  Proposition  5.3  for  the  pointer  doubling  algorithm  achieves  a 
log  log  N  improvement  in  work  over  the  naive  simulations,  i.e.,  instead  of  5  =  O(log  N  • 
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5u,)  we  achieve  5  =  0(log  N  •  S^uJ  log  log  N).  This  improvement  can  be  used  with  several 
important  robust  algorithms  that  are  based  on  pointer  doubling: 

Proposition  5.4  There  is  a  robust  parallel  algorithm  for  computing  the  tree  functions 
of  Tarjan  and  Vishkin  [96]  with  S  =  0(\ogN  ■  Sw{N,P)/\og\ogN),  where  N  is  the 
input  tree  size  and  P)  is  the  complexity  of  algorithm  W  for  the  initial  number 

of  processors  P  :  \  <  P  <  N . 

The  robust  algorithms  obtained  using  our  technique  are  optimized  for  the  worst 
case  behavior  in  the  presence  of  arbitrary  fail-stop  error  patterns.  These  algorithms 
incur  a  0(log^  N)  multiplicative  overhead  relative  to  the  source  algorithm  when  using  N 
processors,  and  this  overhead  is  reduced  to  C>(log  JV)  in  the  absence  of  failures.  However 
the  robust  list  ranking  algorithm  can  be  tailored  to  yield  S  =  0(./Vlog  N)  in  the  absence 
of  failures.  This  is  accomplished  by  preceding  the  algorithm  with  log  N  pointer  doubling 
operations  and  a  phase  4  bottom-up  traversal.  This  results  in  0(iV log  JV)  additive 
overhead.  Therefore,  for  the  algorithms  that  are  dominated  by  pointer  doubling  with 
a  cost  of  0(A^log  N),  e.g.,  Tarjan  and  Vishkin  [96],  there  is  no  asymptotic  degradation 
in  the  absence  of  failures.  This  optimization  can  be  used  with,  for  example,  the  sorting 
techniques  such  as  Batcher  [18]  to  reduce  the  overall  multiplicative  cost  to  0{\ogN)  in 
the  absence  of  failures  over  the  0{N  log^  N)  cost  associated  with  sorting  networks. 

Finaly,  when  a  Write-All  solution  is  used  within  its  range  of  optimality  (by  tak¬ 
ing  advantage  of  parallel  slackness)  as  the  basis  for  fault-tolerant  algorithms,  then  we 
obtain  fault-tolerant  solutions  to  all  of  the  above  problems,  such  that  the  available  pro¬ 
cessor  steps  S  is  asymptotically  equal  to  the  Parallel-time  x  Processors  of  the  original 
algorithms. 


5.6  On  Parallel  Complexity  Classes  and  Fault-Tolerance 

We  have  briefly  mentioned  in  the  discussion  of  efficiency  measures  in  Chapter  2  that 
it  is  important  for  parallel  algorithms  to  have  efficient  work  both  in  the  failure-free 
environment  and  when  they  are  subject  to  failure.  We  have  also  remarked  in  this 
chapter  that  efficient  simulations  of  parallel  algorithms  result  in  efficient  algorithms 
only  if  the  simulated  algorithm  is  efficient  to  begin  with.  Here  we  address  these  topics 
further. 
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Many  efficient  parallel  algorithms  belong  to  the  class  AfC,  however  the  inverse  is  not 
necessarily  true.  This  is  because  the  algorithms  in  MC  allow  for  polynomial  inefficiency 
in  work.  In  AfC  the  efficiency  is  characterized  in  terms  of  (polylogarithmic)  time,  but  the 
computational  agent  can  be  large  (polynomial)  relative  to  the  size  of  a  problem  [30,  81]. 
Further  critique  of  the  notion  that  AfC  class  of  algorithms  is  the  class  of  efficient  parallel 
algorithms  is  given  by  Kruskal  et  al.  in  [63]. 

In  the  context  of  fault-tolerant  computation  we  suggest  that  while  a  definition  of 
robustness  can  be  made  in  terms  of  AfC,  this  definition  is  not  very  meaningful  in  terms 
of  the  resulting  algorithm  efficiency.  An  AfC  algorithm  can  be  made  fault-tolerant  by 
clustering  a  polynomial  number  of  processors  and  assigning  them  to  the  work  that  is 
normally  performed  by  a  single  processor.  The  resulting  algorithm  is  correct  for  as  long 
as  at  least  one  processor  remains  active  in  each  cluster.  But  clearly  such  algorithm 
is  extremely  inefficient,  and  such  technique  is  extremely  wasteful,  even  if  the  resulting 
algorithm  still  meets  the  N’C  criteria  of  efficiency,  i.e.,  polylogarithmic  time  and  and 
polynomial  resources. 

To  reiterate:  in  order  to  characterize  better  the  efficiency  of  paraUel  algorithms,  the 
efficiency  measures  need  to  take  into  account  both  the  parallel  time  and  the  size  of  the 
computational  resource,  i.e.,  parallel  work.  Such  characterization  of  parallel  algorithm 
efficiency  are  defined  by  Vitter  and  Simons  in  [100]  and  expanded  on  by  Kruskal  et  al.  in 
[63].  The  efficiency  classes  defined  in  [63]  are  as  foUows: 

Let  A  be  a  problem  such  that  the  (RAM)  time  complexity  of  the  best  known  se¬ 
quential  algorithm  is  T{N).  A  parallel  algorithm  that  solves  an  N-size  instance  of  A 
using  P{N)  processors  in  t(N)  parallel  time  belongs  to  the  class: 

1.  ENC  if  t{N)  =  log^(^)(r(A))  and  t{N)  ■  P{N)  =  0{T{N)). 

2.  EP  if  t{N)  <  T{NY  (const  e  <  1)  and  t{N)  ■  P{N)  =  0{T{N)). 

3.  ANC  if  t{N)  =  log®<^)(T(iV))  and  t{N)  •  P{N)  =  T{N)  •  log®(^)(^(iV)). 

4.  AP  if  t{N)  <  T{NY  (const  £  <  1)  and  r(fV)  -  P{N)  =  T{N)  ■  \og°^^\T{N)). 

5.  SNC  if  r(JV)  =  log^^^ )(T(iV))  and  t{N)  •  P{N)  =  T{Nf^^\ 

6.  SP  if  t(N)  <  TiNY  (const  £  <  1)  and  t{N)  ■  P{N)  =  T{Nf^U, 
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Analogously  with  our  definition  of  robustness,  the  complexity  characterizing  these 
classes  is  defined  with  respect  to  the  time  complexity  of  the  best  sequential  algorithm. 
There  are  two  complexity  criteria  for  each  class:  (a)  parallel  time  t{N)  and,  (b)  parallel 
work  t(N)  •  P{N). 

In  the  next  two  subsections  we  define  criteria  via  which  we  can  evaluate  whether  our 
algorithm  transformations  preserve  the  efficiency  of  the  algorithms  in  each  the  classes 
above. 

In  order  to  be  able  to  use  the  time  complexity  of  the  original  algorithm  as  a  compari¬ 
son  metric,  we  need  to  introduce  some  measure  of  time  for  the  fault-tolerant  algorithms. 
In  order  to  do  that,  we  will  use  the  maximum  time  that  is  required  by  the  fault-tolerant 
algorithm  to  complete  its  computation  provided  that  a  Unear  number  of  processors  are 
still  active.  For  the  fail-stop  model  this  corresponds  to  an  execution  in  which  at  least 
cP  processors  survive  for  some  constant  c  >  0,  while  for  the  restartable  model  we  ask 
that  each  update  cycle  is  completed  by  at  least  cP  processors. 

If  we  do  not  make  this  assumption  then  the  best  we  can  conclude  about  the  running 
time  of  the  fault-tolerant  versions  is  that  it  is  at  least  the  time  of  the  best  sequential 
algorithm,  because  time  can  be  severely  degraded  when  the  remaining  number  of  pro¬ 
cessors  becomes  small.  For  example,  the  algorithms  become  sequential  when  only  one 
processor  is  active. 

5.6.1  Fail-stop  model  without  restarts 

We  first  examine  whether  the  classes  of  [63]  are  closed  with  respect  to  our  fault-tolerant 
transformations  in  the  fail-stop  model.  We  assume  that  P  =  P{N)  processors  are  used. 
Also,  if  P  is  polynomial  in  N,  then  logF  =  O(log  A). 

For  any  algorithm  A,  let  ^(A)  be  the  fault-tolerant  algorithm  that  can  be  constructed 
using  the  techniques  in  this  chapter  (as  either  a  simulation  or  a  dedicated  algorithm). 
We  formulate  the  following  definition: 

Definition  5.3  Let  C  be  a  class  in  which  the  parallel  time  of  algorithms  is  in  the 
complexity  class  tq  and  the  parallel  work  is  in  the  complexity  class  wc-  We  say  that 
C  is  closed  with  respect  to  a  fail-stop  without  restart  fault-tolerant  transformation  ^  if 
for  any  algorithm  A  in  C: 
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Complexity 

Class 

Time  with  >  cP  processors 
C2r(A)log^  A/  log  log  A 

Fail-Stop  Work 
C2log°<')A  •r(A)-P(A) 

Closed 

under 

ENC 

=  0(logO(i)(T(A))) 

>  OiTiN)) 

No 

EP 

=  OiTiNY) 

>  OiTiN)) 

No 

ANC 

=  logO(')(T(A)) 

=  r(A)-log''(')(T(A)) 

Yes 

AP 

=  OiTiNY) 

=  7’(A).log''(')(T(A)) 

Yes 

SNC 

=  log^(’)(T(A)) 

=  r(A)<^(’) 

Yes 

SP 

=  0(T(Ar) 

=  r(A)°(’) 

Yes 

Table  5.1:  Closure  under  the  fail-stop  transformation  ^  (for  P  =  P{N)). 

1.  the  worst  case  work  S  of  ^(j4)  is  such  that  S  is  in  wc,  and 

2.  the  running  time  t  is  such  that  t  is  in  tc  when  the  minimum  number  of  processors 
active  during  the  computation  is  cP  for  some  constant  c  >  0.  □ 


An  immediate  observation  is  that  AfC  is  trivially  closed  with  respect  to  our  fault- 
tolerant  transformations. 

In  the  fail-stop  model,  using  for  example  algorithm  W  as  the  beisis  for  transforming 
non-fault-tolerant  algorithms,  we  have  the  following: 


•  the  multiplicative  overhead  in  work  is  O(log  A^/loglog  A),  and  so  if  the  work  of 
the  initial  algorithm  A\s  t{N)-P{N)  then  the  worst  case  work  of  the  fault-tolerant 
version  ^(A)  is  cj  log*^^'^  N  •  t{N)  •  P{N)  for  some  constant  ci  >  0, 

•  algorithm  W  terminates  in  0(5u,/cF)  =  0(log*  A/loglog  A)  time  when  at  least 
cP  processors  are  active,  therefore  if  the  parallel  time  of  algorithm  A  is  t(A), 
then  the  parallel  time  of  execution  for  ^(A)  using  at  least  cP  active  processors  is 
C2T(A)log^  A/ log  log  A  for  some  constant  C2  >  0, 


The  resulting  closure  properties  of  the  classes  defined  in  [63]  under  our  fail-stop 
transformation  is  summarized  in  Table  5.1. 
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5.6.2  Restartable  fail-stop  model 

We  next  examine  whether  the  classes  of  [63]  are  closed  with  respect  to  our  transforma¬ 
tions  in  the  restartable  fail-stop  model.  Again  we  have  log  P  =  O(log  W). 

For  any  algorithm  A,  let  p(A)  be  the  restartable  fault-tolerant  algorithm  that  can  be 
constructed  using  the  techniques  in  this  chapter  We  formulate  the  following  definition: 

Definition  5.4  Let  C  he  &  class  in  which  the  parallel  time  of  algorithms  is  in  the 
complexity  class  tc  and  the  parallel  work  is  in  the  complexity  class  wc-  We  say  that  C 
is  closed  with  respect  to  a  restartable  fail-stop  fault-tolerant  transformation  p  if  for  any 
algorithm  A  in  C: 

1.  the  worst  case  overhead  a  of  p{A)  is  such  that  a  •  t{N)  •  P{N)  is  in  wc,  and 

2.  the  running  time  t  is  such  that  t  is  in  tc  when  the  number  of  processors  completing 
each  update  cycle  of  the  computation  is  at  least  cP  for  constant  c  >  0.  □ 

In  the  fail-stop  restartable  model  we  are  going  to  take  advantage  of  the  existential 
result  by  Anderson  and  WoU  in  [8],  who  showed  that  for  every  £  >  0,  there  exists 
a  deterministic  algorithm  for  P  processors  that  simulates  P  PRAM  instructions  with 
work.  This  result  was  developed  for  the  asynchronous  model,  but  it  also  applies 
for  fail-stop  model  with  restarts. 

In  this  model  we  will  provide  existential  closure  properties.  Assume  we  have  an 
algorithm  such  as  the  one  characterized  in  the  theorem  above.  This  algorithm  can  be 
interleaved  with  algorithm  V,  for  example,  so  that  the  overhead  <t  of  the  combined  algo¬ 
rithm  is  O(log^  N).  When  this  combined  algorithm  is  used  as  the  basis  for  transforming 
non-fault-tolerant  algorithms,  we  have  the  following: 

•  if  the  work  of  the  initial  algorithm  A  is  t{N)  •  P{N)  then  a  ■  t{N)  •  P{N)  = 
.og2  A-r(A)-P(A), 

•  a  single  Write-All  step  terminates  in  0{P^'^‘ J cP)  =  0{P‘)  time  when  at  least 
cP  processors  are  active,  therefore  if  the  parallel  time  of  algorithm  A  is  r(A), 
then  the  parallel  time  of  execution  for  p{A)  using  at  least  cP  active  processors  is 
C2t{N)  •  P®  for  some  constant  cj  >  0, 
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Complexity 

Class 

Time  with  >  cP  processors 
c-t{N)-P‘ 

Work 

log^JV  •T(JV)-P(iV) 

Closed 
under  pi 

ENC 

>  0(\og^^^HTiN))) 

>  0(T(N)) 

No 

EP 

=  OiT(Nr) 

>  OiTiN)) 

No 

ANC 

>  log^(')(T(iV)) 

=  T(JV).logO(')(T(Ar)) 

Yes 

AP 

=  0{T{NY) 

=  r{JV).log^(i)(T(iV)) 

Yes 

SNC 

>  log^(')(T(iV)) 

= 

No 

SP 

0(T(iV)') 

=  T(JV)0(^) 

Yes 

Table  5.2:  Closure  under  the  restartable  fail-stop  transformation  p  (for  P  =  P{N)). 

The  closure  properties  of  the  classes  of  [63]  under  the  restartable  fail-stop  transfor¬ 
mation  is  summarized  in  Table  5.2. 


Chapter  6 

Simplifying  Memory 
Assumptions 

IN  ALL  our  algorithms  we  assume  that  the  shared  memory  is  in  a  known  state  (i.e., 
contains  zeros)  prior  to  the  very  first  execution  of  a  Write-All  algorithm,  and  that 
the  writes  of  a  logarithmic  number  of  bits  are  atomic.  In  the  chapter  we  formally  relax 
these  model  requirements. 

6.1  Solving  Write-All  Using  Contaminated  Memory 

As  we  have  shown  in  Chapter  5,  the  problem  of  Write-All  — using  P-processors  write 
I’s  into  all  locations  of  an  array  of  size  A,  where  P  <  N —  can  and  has  been  used  as  the 
basic  building  block  for  constructing  efficient  and  fault-tolerant  parallel  algorithms.  All 
previous  Write-All  solutions  use  ft(P)  auxiliary  shsu’ed  memory  and  assume  that  this 
memory  is  cleared  or  initialized  to  some  known  value.  When  Write-All  building  blocks 
are  used  in  polylogarithmic  parallel  time  algorithms  (e.g.,  to  compute  prefix  sums  or 
list  ranking)  auxiliary  memory  initialization  cannot  be  amortized  over  the  computation. 
Thus,  assuming  clear  memory  is  a  very  strong  precondition,  and  for  Write-All  itself 
raises  a  legitimate  “chicken-or-egg”  objection. 

In  this  section,  using  a  deterministic  bootstrapping  and  balancing  argument,  we  show 
how  to  Write-All  when  auxiliary  memory  is  contaminated  with  arbitrary  values.  For 
any  dynamic  pattern  of  fail-stop,  no-restart  errors  on  a  CRCW  PRAM  with  at  least  one 
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surviving  processor,  our  new  algorithm  writes  all  I’s  using  0(iV  +  Plog^  A^/(logIog^  N)) 
work,  without  any  initialization  assumption.  This  technique  can  be  combined  with 
any  Write- All  algorithm  to  yield  efficient  simulations  of  any  PRAM  and  even  optimal 
simulations  given  processor  slack.  It  can  also  be  used  with  restartable  fail-stop  processor 
simulations.  In  addition,  we  show  that  for  the  parallel  prefix  computation  it  is  possible 
to  improve  on  the  best  deterministic  simulations  to  date:  by  a  factor  of  log  N  when  the 
memory  is  clear,  and  by  a  factor  of  log  log  TV  when  the  memory  is  contaminated. 

6.1.1  Write-All  assumptions 

Write- All  captures  the  computational  progress  that  can  be  naturally  accomplished  in 
unit  time  by  a  PRAM  (when  P  =  N).  In  the  presence  of  asynchrony  or  failures,  efficient 
solutions  to  Write-All  (increasing  the  fault-free  work  by  polylogarithmic  factors  only) 
are  non-obvious.  Note  that,  in  all  existing  solutions  it  does  not  matter  what  is  the 
initial  state  of  the  size  N  array.  For  example,  up  to  now,  we  assumed  it  is  all  O’s,  but 
the  algorithms  would  work  even  if  the  N  locations  were  initialized  using  arbitrary  O’s 
and  I’s.  A  much  more  important  assumption  in  aU  previous  Write-All  solutions  (both 
in  this  thesis,  and  by  other  authors,  e.g,  [29,  59,  61,  75])  was  regarding  the  initial  state 
of  additional  auxiliary  memory  used  (typically  of  il{P)  size).  The  basic  assumption  has 
been  that: 

The  fl(P)  auxiliary  shared  memory  is  cleared  or  initialized  to  some  known  value. 

In  theory,  this  is  a  natural,  even  if  unstated  assumption,  for  PRAMs  [44]  and  RAMs 
(cf.,  Turing  Machine  auxiliary  tapes  are  initially  blank).  However,  given  the  definitio  .  of 
Write-All  this  dependence  on  clear  space  raises  a  legitimate  “chicken-or-egg”  objection. 
In  practice,  memory  locations  typically  contain  unpredictable  values,  and  processes  that 
need  to  use  large  blocks  of  memory  cannot  assume  that  it  is  cleared  or  is  initialized  to 
a  known  value.  In  fact  operating  systems  usually  provide  explicit  services  that  allocate 
clear  memory,  e.g.,  caIloc()  in  standard  C  libraries.  Such  allocation  is  predictably  much 
more  time  consuming,  even  in  the  absence  of  failures. 

It  is  easy  to  construct  simple  Write- All  algorithms  that  do  not  a.ssume  clear  shared 
memory,  but  they  appear  to  use  quadratic  work.  If  the  overall  computation  involves 
many  steps,  one  can  perhaps  afford  an  expensive  initialization  phase  and  amortize  its 
cost  over  subsequent  efficient  steps.  Unfortnately,  when  Write- All  building  blocks  are 
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used  in  very  fast  (i.e.,  polylogarithmic  parallel  time)  algorithms  (e.g.,  to  compute  prefix 
sums  or  list  ranking)  auxiliary  memory  initialization  cannot  be  amortized  over  the 
computation.  Fortunately,  we  show  that  there  is  a  way  around  this  dilemma: 

We  present  Write- All  algorithms  and  algorithm  simulations  that  do  not  require 

that  the  auxiliary  memory  is  cleared  prior  to  the  computation. 

Algorithms  in  this  setting  have  some  similarities  with  the  notion  of  a  self-stabilizing 
system  introduced  by  Dijkstra  in  [34].  Paraphrasing  [34],  a  system  is  self-stabilizing  if 
and  only  if,  regardless  of  the  initial  state  the  system  can  always  make  a  state  transition 
into  another  state,  and  the  system  is  guaranteed  to  find  itself  in  a  legitimate  state  after 
a  finite  number  of  transitions.  Our  computations  using  initially  contaminated  memory 
can  be  viewed  as  self-stabilizing  with  respect  to  the  state  of  shared  memory. 

We  eliminate  the  assumption  that  any  amount  of  clear  initial  memory  is  available 
for  the  fail-stop  and  fail-stop  restartable  algorithms.  We  develop  deterministic  fault- 
tolerant  algorithms  that  can  be  used  to  simulate  PRAMs  using  contaminated  memory, 

i.e.,  when  the  shared  memory  not  containing  the  input  is  initially  in  an  arbitrary  and 
possibly  iUegal  state.  We  also  improve  on  the  state-of-the-art  robust  prefix  sums  com¬ 
putations. 

6.1.2  Model;  fail-stop  PRAM  with  contaminated  memory 

The  basis  of  our  model  is  the  restartable  fail-stop  CRCW  PRAM  of  Sections  2.5-2.6 
except  that  the  shared  memory  that  does  not  contain  the  input  is  contaminated: 

1.  There  are  P  processors.  Each  has  a  unique  processor  identifier  PID  in  the  range 

2.  Shared  memory  is  accessible  to  all  processors;  each  processor  has  a  constant  size 
private  memory.  Each  memory  cell  stores  one  word  of  size  0(logmax{iV,  P}). 

3.  The  input  is  stored  in  N  cells  in  shared  memory. 

4.  The  shared  memory  not  containing  the  input  is  contaminated. 

We  use  the  notation  “Writ€-All{N,P,L)”  to  stand  for  an  instance  of  fault-tolerant 
Write-All  that  uses  P  processors  and  c/ear  auxiliary  memory  of  size  L  to  initialize  to  1 
an  array  of  size  N. 
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Definition  6.1  An  algorithm  that  uses  P  processors  to  solve  a  Write-All  problem  of 
size  N  is  contamination-tolerant,  if  it  is  a  Write- All{N,  P,0)  algorithm.  □ 

6.1.3  Write- All  algorithms  using  contaminated  memory 

The  Write-All  algorithms  and  simulations  based  on  Write-All  paradigm,  e.g.,  [55, 59, 61, 
92],  or  the  algorithms  that  can  serve  as  Write-All  solution,  e.g.,  the  addition  algorithm 
in  [29]  or  the  maximum  finding  algorithm  in  [75],  invariably  assume  that  a  linear  portion 
of  shared  memory  is  either  cleared  or  is  initialized  to  known  values.  Starting  with  a  non- 
contaminated  portion  of  memory,  such  algorithms  and  simulations  are  able  to  perform 
their  computation  by  “using  up”  the  clear  memory,  and  concurrently  or  subsequently 
clearing  additional  segments  of  memory  needed  for  future  iterations.  We  develop  an 
efficient  Write-All  solution  that  requires  no  clear  shared  memory. 

A  Bootstrap  procedure 

We  formulate  a  bootstrap  approach  to  the  design  of  fault-tolerant  Write-All  algorithms, 
such  that  the  auxiliary  memory  is  initially  contaminated.  The  bootstrapping  proceeds 
in  stages: 

In  stage  1  of  our  procedure,  all  P  processors  clear  an  initial  segment  of  No  locations 
in  the  auxiliary  memory. 

At  the  stage  i  of  the  procedure,  we  use  P  processors  to  clear  N,+i  memory  locations 
with  the  help  of  N,  memory  locations  that  were  cleared  in  the  stge  z  —  1. 

If  Ni^-i  >  Nj  and  No  >  1,  then  this  procedure  will  clear  the  required  N  memory 
location  in  at  most  N  stages.  Say  r  is  the  final  stage  number,  i.e.,  Nt  =  N. 

Let  Pi  be  the  number  of  active  processors  that  initiate  phase  i,  and  define  N_i  =  0. 
The  cost  of  such  a  procedure  is:  Shoot  =  531=1  Si{Ni,  Pi,  Ni-\)  where  Si  is  the  cost  of 
the  Write- All(Ni,  Pi,  Ni-\)  algorithm  used  in  stage  i. 

The  efficiency  of  the  resulting  algorithm  depends  on  the  choices  of  the  particular 
Write-All  solution(s)  used  in  each  stage  and  the  parameters  N,. 

One  specific  approach  is  to  define  a  series  of  multipliers  Go,  Gi,  . . .  ,  Gt  such  that 
Ni  =  nj=o^j-  The  high  level  view  of  such  algorithm  is  given  in  Figure  6.1.  The 
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01  forall  processors  PID=0..P  —  1  parbegin  — P  processors  clear  N  memory 

02  Clear  the  initial  block  of  Nq  =  Go  elements  sequentially  using  P  processors 

03  i  :=  0 - Iteration  counter 

04  while  TV,  <  TV  do 

05  Use  a  Write-All  solution  with  data  structures  of  size  TV, 

06  and  G,+i  elements  at  the  leaves 

07  to  clear  memory  of  size  TV,+i  =  TVj  •  G,+i 

08  i  :=  »:  +  1 

09  od 

10  parend 


Figure  6.1:  A  high  level  view  of  the  bootstrap  algorithm. 


algorithm  consists  of  an  initialization  (lines  02-04)  and  a  parallel  loop  (lines  04-09).  We 
use  a  variation  of  this  scheme  below. 


We  next  use  the  bootstrap  approach  to  construct  and  analyze  contamination-tolerant 
Write- All  algorithms  in  the  fail-stop  and  restartable  fail-stop  models. 


Algorithm  Z  for  the  fail-stop  model 

We  will  algorithm  W  in  each  phase  of  the  bootstrap  procedure,  and  we  call  the  resulting 
algorithm,  algorithm  Z. 

We  analyze  algorithm  Z  for  the  following  choice  of  parameters:  we  use  Go  =  log  TV, 
and  G,  =  G,_i  log  TV  (for  i  >  0).  In  the  initialization,  all  P  processors  traverse  a  list  of 
size  Go  sequentially  and  clear  it.  Then,  iteratively,  the  processors  use  algorithm  W  to 
clear  increasingly  larger  sections  of  memory  using  the  auxiliary  memory  cleared  in  the 
previous  iteration  (Fig.  6.1,  lines  05-07). 

Recall  that  algorithm  W  is  a  fml-stop  (no  restart)  Write-All  solution.  It  uses  two 
full  binary  trees  (represented  as  heaps  in  memory)  and  it  consists  of  a  loop  in  which 
the  active  processors  synchronously  iterate  through  four  tree  traversal  phases.  To  avoid 
a  complete  restatement,  the  reader  is  urged  to  refer  to  Section  3.2.1.  When  we  use  a 
parameterized  algorithm  W ,  with  the  result  of  Martel  (Appendix  A.6),  the  work  of  the 
algorithm  is  (similarly  to  Lemma  3.6): 
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Theorem  6.1  Algorithm  W  with  P  processors,  the  progress  tree  with  H  leaves  {P  < 
H)  and  2H  —  1  total  nodes  all  initialized  to  zero  and  G  array  elements  at  each  leaf,  has 
the  work  of  5  =  0{{H  +  PlogH /loglogH)  •  (logP  +  logH  +  G))  for  any  pattern  of 
stop-failures. 

Note  that  the  above  result  and  algorithm  W  can  be  used  when  P  >  H .  As  we  have 
already  described  in  Section  3.2.1,  when  there  are  P  processors  and  the  progress  tree 
has  H  <  P  leaves,  then  it  is  sufficient  for  each  processor  to  take  its  PID  modulo  H  to 
assure  uniform  initial  assignment  of  processors  and  to  preserve  the  result. 

Algorithm  W  stores  its  binary  trees  as  linear  arrays  interpreted  as  heaps.  Therefore 
the  structure  of  the  trees  is  unaffected  by  the  state  of  the  memory,  because  the  heaps 
are  implicit.  We  next  observe  that  the  enumeration  of  the  processors  in  phase  W1  of 
algorithm  W  can  be  done  in  a  bottom-up  traversal  of  a  contaminated  processor  tree. 
The  pseudocode  for  this  algorithm  is  given  in  Figure  6.2. 

We  call  this  algorithm  Zenum  ■  The  surviving  processors  enumerate  themselves  using 
a  standard  logarithmic  time  algorithm  based  on  addition.  The  contaminated  memory 
cells  are  distinguished  from  the  cells  that  contain  valid  values  via  the  use  of  a  single  bit 
associated  with  each  cell  (a  so  called  “deadman  flag”).  When  a  processor  arrives  at  a 
node,  it  clears  the  bit  associated  with  its  sibling,  then  it  sets  its  own  bit  (lines  16-17). 
Only  cells  that  have  valid  values  written  in  them  by  active  processors  will  have  the  bit 
set.  The  enumeration  itself  is  a.s  in  phase  Wl. 

Theorem  6.2  Algorithm  Z  is  a  contamination-tolerant  Write- All{N ,P,d)  algorithm 
that  fo  any  pattern  of  fail-stop  errors  has  S  =  0{N  -\-  Plog^  A/(loglog  A)^)  for  1  < 
P  <N. 

Proof:  We  first  evaluate  and  then  total  the  work  of  the  algorithm  during  each  of 
the  finite  numbers  stages  of  its  execution.  In  each  use  of  algorithm  W,  we  will  have 
G  =  log  N  as  the  number  of  memory  locations  associated  with  each  leaf  of  the  progress 
tree,  and  we  will  apply  Thm  6.1  with  different  instantiations  of  H  to  evaluate  the  upper 
bound  of  work. 

Stage  0:  Enumerate  processors  using  Zenum,  then  sequentially  clear  log  A  memory 
using  all  surviving  processors.  The  work  using  the  initial  Pq  <  P  processors  is;  Wq  = 
Po  -logP  -t-  Po  •  log  A. 


I 

I 
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forall  processors  PID  =  0..P  —  1  parbegin 

shared  integer  array  c(1..2iV  —  1];  — processor  counts 
shared  bit  array  alive[l..2N  —  1];  — alive/dead  markers 
private  integer  pn  — enumerated  processor  number 

private  integer  jl,j2, - left/right  siblings  indices 

t\  — predecessor  index  of  jl  and  j2 

jl  ;=  PID  +  (iV  -  1); - heap-leaf  init 

pn  :=  1;  — assume  this  processor  is  no.  1 

c[jl]  :=  1;  — a  processor  is  counted  once  in  this  step 

for  l..log(P)  do  — traverse  the  tree  from  leaf  to  root 
t  ;=  jl  div  2;  — parent  of  jl  and  j2 
if  2  ♦  <  =  jl 

then  j2  ;=  jl  +  1 - jl  came  from  left 

else  j2  ;=  jl  —  1  — jl  came  from  right 

fi; 

alive[j2]  :=  0  — mark  siblings  dead 
alive\jl\  :=  1  — mark  self  alive 

ifalive[j2]  =  1  — both  sub-trees  have  active  processors? 
then  c[<]  :=  c[jl]  +  c\j2]  — both  branches  are  active 

if  jl  >  j2  — jl  came  from  right,  update  processor  n  umber 
then  pn  ;=  pn  +  c[j2] 
fi 

else  c[t]  :=  c[jl]  — all  siblings  failed 

fi; 

jl  :=  t  — advance  up  the  heap 
od 
parend 

Figure  6.2:  Contamination  robust  processor  enumeration  Zenum- 


.Stage  1:  <  Pq  <  P.  Using  instance  of  Thm  6.1  where  H  =  log  TV,  the  work  is: 

Wi  =  (log  TV  +  Fi  log  log  Nf  log  log  log  N)  •  (log  Pi  +  log  TV  +  log  log  TV). 

Stage  i:  Pi  <  Pi-\  <  TV.  Using  instance  where  H  =  log’  TV: 


Wi  =  (log’  TV  +  P,  •  i  log  log  TV/ (log  i  +  log  log  log  TV ) )  •  (log  P  +  log  TV  +  t  log  log  TV ) 
The  Final  Stage  r  is  when  log’^  TV  =  TV/ log  TV,  i.e.,  r  =  -  1- 

Totalling  the  work  in  all  phases  yields: 


5  =  =  lUo  +  E  (>og‘ +  Pi 

1 _ n  ' 


i  log  log  TV 
log*  +  log  log  log  TV 


^  (log  Pi  +  log  TV  +  t  log  log  TV) 


Simplifying  the  sum  results  in  5  =  0{N  +  Plog^TV/(loglogTV)^).  □ 


This  approach  has  the  following  range  of  optimality: 
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Theorem  6.3  For  any  pattern  of  fail-stop  errors,  algorithm  Z  is  a  contamination- 
tolerant  Write-All(N,  N(]oglog  N)^/  log^  N,0)  algorithm  with  S  =  0(N). 

Algorithm  Zr  for  the  restartable  fail-stop  model 

Algorithm  Zr  is  similar  to  algorithm  Z,  except  that  in  each  stage  we  will  be  utilizing 
a  restartable  Wnte-All  algorithm.  (Algorithm  W  is  not  suitable  when  restarts  are 
allowed.)  Other  parameters  of  the  bootstrap  procedure  are  the  same  as  for  the  fail-stop 
case. 

In  this  analysis,  we  wiU  be  using  an  algorithm  that  was  described  and  characterized 
with  the  following  result  by  Anderson  and  Woll: 

Theorem  6.4  [8]  There  exists  a  Write- All{H ^  H ,  H )  solution  with  H  processors  that 
has  work  for  every  t  >  0. 

This  is  an  existential  result,  and  we  call  this  algorithm  AW.  The  best  known 
constructed  deterministic  algorithm  has  e  =  log2  3  —  1  <  0.59  is  algorithm  X  (it  can 
also  be  used  with  the  bootstrap).  Note  that  algorithm  AW  was  developed  for  the 
asynchronous  model,  but  it  can  be  used  in  the  restartable  fail-stop  model  as  well.  The 
work  of  the  algorithm  in  the  asynchronous  model  is  the  same  as  its  completed  work  in 
the  restartable  fail-stop  model. 

Theorem  6.5  Algorithm  Zr  is  a  contamination-tolerant  Write- All{N ,  N .,11)  algorithm 
that  for  any  pattern  of  fail-stop  errors  has  S  =  for  any  £  >  0. 

Proof:  We  first  note  that  there  exists  a  Write- All{H ,  P,  H )  solution  with  P  >  H 
processors  that  has  work  0{P^'^‘)  for  every  £  >  0.  We  use  algorithm  AW,  except  all 
processors  use  their  PIDs  modulo  H.  The  worst  case  work  is  achieved  when  up  to 
1"^]  processors  that  have  the  same  PID  module  H  operate  synchronously  as  a  single 
processor.  The  work  of  the  algorithm  in  this  case  is:  S  =  =  0{P^'^^). 

Using  this  algorithm  at  each  stage  of  the  bootstrap  procedure,  and  evaluating  the  total 
work  as  in  Thm  6.2  yields  the  desired  result: 

We  evaluate  and  then  sum  the  work  of  the  algorithm  during  each  of  the  finite 
numbers  stages  of  its  execution.  In  each  stage  z  >  1  of  algorithm  Zr,  we  will  use 
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algorithm  AW  log  N  times  to  clear  log*"*"’  N  memory  locations.  In  each  instance  of  use 
of  Theorem  6.4,  we  wiU  use  ^  >  0  as  the  exponent,  such  that  e/2  =  6.  This  is  done  to 
simplify  the  final  sum  using  the  property  that  log  N  =  0{N^)  for  any  ^  >  0  We  also 
use  P  =  iV  for  clarity. 

Stage  0:  AH  processors  linearly  initialize  the  segment  of  shared  memory  of  length  log  N 
using  The  work  is;  Wq  =  P  ■  log  N. 

Stage  1 :  The  algorithm  is  applied  log  N  times  to  clear  a  segment  of  shared  memory  of 
size  log^  N .  Using  instance  where  H  =  logiV,  the  work  is:  W^  =  (Plog*  A)  -  log  A^. 

Stage  i:  Using  instance  H  =  log*  N:  Wi  =  (P(log*  N^N)  •  log  N  =  (Plog**  N)  •  log  iV. 

Final  Stage  r  where  log’’ A  =  N/logN,  i.e.,  r  =  log  A/ log  log  iV  -  1.  Using  the 
instance  where  H  =  log’ iV  =  N/\ogN,  the  work  is:  Wr  =  (P(log’ A)^)  -logJV  = 
(P(  A/  log  A)0  •  log  A  =  P  •  A^  log^-^  A. 

5  =  =  Wo  +  ;^(Plog*^  A)-logA  =  0(A’+*log’-^  A) 

t=0  j=l 

=  0(A’+^logA)  =  0(A*+'). 

□ 


6.1.4  General  simulations  and  algorithm  transformations 

Using  the  contamination-tolerant  Write-All  solutions  we  have  developed  for  the  fail-stop 
no-restart  and  fail-stop  restartable  models,  we  obtain  the  general  simulation  results  and 
some  improvements  for  algorithm  transformations. 

Oblivious  simulations 

For  the  setting  with  initially  contaminated  shared  memory,  using  algorithms  Z  and 
Zr  with  the  general  algorithm  simulation  techniques  from  Chapter  5,  we  obtain  the 
following  results: 

Theorem  6.6  Any  A-processor,  r  parallel  time  PRAM  algorithm  can  be  simulated 
using  0(A)  contaminated  memory  and  P  fail-stop  CRCW  processors  with 

5  =  0(A  -I-  Plog^  A/(loglog  A)*  -1-  T  ■  Plog^  A/ log  log  A)  for  1  <  P  <  A. 
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This  simulation  has  optimal  ranges; 

Corollary  6.7  Any  A-processor,  r  parallel  time  PRAM  algorithm  can  be  simulated 
using  0(N)  contaminated  memory  and  P  fail-stop  CRCW  processors  with  S  =  0(r  •  N) 
when: 

1.  1  <  P  <  iV(loglog  A)Vlog^iV),  or 

2.  1  <  P  <  iVloglog  A/log^  jV)  and  r  >  log  A/ log  log  A. 

In  the  restartable  fail-stop  model  we  get: 

Theorem  6.8  Any  A-processor,  r  parallel  time  PRAM  algorithm  can  be  simulated 
using  0(N)  contaminated  memory  and  N  restartable  fail-stop  CRCW  processors  with 
S  =  0(~-N^+^). 

Remark  6.1  We  can  also  use  the  complexity  measure  of  overhead  ratio  a  to  evalu¬ 
ate  the  efficiency  of  simulations  by  that  amortizing  the  work  of  a  simulation  over  the 
necessary  work  and  the  number  of  failures  that  are  encountered.  The  simulation  in  the 
restartable  fail-stop  model  has  overhead  ratio  per  PRAM  step  of  (T  =  A'.  This  overhead 
ratio  can  be  made  polylogarithmic  by  interleaving  algorithm  Zr  with  algorithm  V  as 
presented  in  Section  5.4. 

Improving  oblivious  simulations 

As  we  have  discussed  in  Chapter  5,  custom  transformations  of  algorithms  are  interesting 
because  in  some  cases  it  is  possible  to  improve  on  the  work  of  the  naive  oblivious 
simulation.  These  improvements  are  most  significant  for  fast  algorithms  when  a  full 
range  of  processors  is  used.  In  the  case  of  parallel  prefix  (Section  5.5.1),  additional 
savings  can  be  carried  over  to  the  fail-stop  no-resta/t  model  in  the  setting  when  the 
shared  memory  is  contaminated.  Using  the  result  of  Theorem  5.9  together  with  a 
contamination-tolerant  Write-All  solution  we  obtain  the  following: 

Theorem  6.9  Parallel  prefix  for  A  values  can  be  computed  using  A  fail-stop  processors 
and  0(A)  contaminated  memory  with  5  =  0(Alog^  A/(loglog  A)^). 
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Note  that  using  N  processors  to  simulate  a  parallel  prefix  algorithm  that  uses  P  =  N 
processors  and  time  log  TV  would  require  the  work  S  =  0(jVlog^TV/loglogiV)  (Theo¬ 
rem  6.6),  and  so  the  custom  algorithm  saves  a  log  log  TV  factor  relative  to  the  oblivious 
simulation. 


6.2  Atomic  Access  and  Word  Size 

Thus  far,  we  relied  on  the  property  of  our  model  to  perform  log  TV-bit  word  parallel 
writes  atomically.  That  is,  the  model  allows  the  foUowing:  (1)  log  TV-bit  words  are 
written  in  unit  time,  and  (2)  the  adversary  can  cause  failures  either  before  or  after  the 
write  cycle  of  the  PRAM,  but  not  during  the  write  cycle.  The  fault-tolerant  algorithms 
we  developed  can  be  modified  so  that  these  two  restrictions  are  relaxed. 

The  new  definition  of  atomicity  becomes: 

(1)  logTV-size  words  are  written  using  log  TV  bit  write  cycles,  and 

(2)  the  adversary  can  cause  arbitrary  fail-stop  errors  either  before  or  after  the  single 
bit  write  cycle  of  the  PRAM,  but  not  during  the  bit  write  cycle. 

Proposition  6.1  Any  fault-tolerant  algorithm  using  O(logTV)  bit  atomic  writes  on 
inputs  /  of  size  TV,  and  using  P  processors  for  1  <  P  <  TV  can  be  adapted  to  use  0(1) 
bit  atomic  writes  so  that  there  is:  (1)  preservation  of  0(5(7,  F,P))  steps  used  by  the 
algorithm  (counting  log  TV  bit  write  cycles  as  one  time  unit),  and  (2)  preservation  of  the 
space  used  (counting  O(logTV)  bits  as  one  word). 

Proof:  The  algorithms  are  adapted  by  simulating  one  atomic  write  of  an  O(logTV)  bit 
word  atomic  writes  using  O(logTV)  atomic  0(1)  bit  writes.  We  implement  logTV-size 
words  using  a  single  bit  tag  and  two  log  TV-size  words.  The  two  words  are  numbered  0 
and  1,  and  the  bit  tag  (initially  0)  indicates  which  of  the  two  words  hais  valid  contents. 

Thus  each  shared  memory  location  is  represented  as: 
record 

bit  integer  (;  — current  valid  version  number 
integer  A[0..1];  — \ogN-size  values  indexed  by  t 

end 
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Each  read  cycle  of  the  shared  memory  now  becomes: 

begin  — macro  read  cycle 

read  tag  from  <;  — read  current  tag 
for  i  =  1  to  log(7V)  do  — read  current  contents 
read  bit  t  of  value  from  bit  t  of  X[tag]\ 
od 

end 

The  write  cycle  to  the  shared  memory  becomes: 

begin  — macro  write  cycle 

read  tag  from  <;  — read  current  tag 
tag  :=  tag  +  1  (mod  2); 

for  j  =  1  to  log(yv)  do - write  new  contents 

write  bit  i  of  value  to  bit  t  of  X[tag]; 
od 

write  tag  to  t;  — update  the  tag 

end _ _ _ 

Since  the  single  bit  tag  is  the  last  bit  written  during  the  write  cycle,  a  failure 
anywhere  during  this  high  level  write  cycle  will  prevent  the  tag  value  to  be  updated, 
and  so  any  subsequent  read  will  be  able  to  read  the  previous  value  stored.  This  approach 
is  similar  to  that  of  Bloom  in  [24],  but  it  is  somewhat  simpler  due  to  the  fact  that  we 
are  dealing  with  the  synchronous  model. 

Fault-tolerant  algorithms  can  be  automatically  transformed  using  the  macro  read 
and  write  cycles  above  to  versions  that  only  require  single  bit  atomic  writes.  Clearly, 
the  number  of  logiV-size  words  read  or  written  by  each  macro  cycle  is  C?(l)  as  before, 
and  the  shared  memory  requirements  are  within  a  factor  of  two  of  the  original  memory 
size.  Therefore,  the  asymptotic  performance  of  the  algorithm  has  not  changed.  □ 

Remark  6.2  It  is  sufficient  to  use  non-atomic  log  iV-size  word  reads/writes  instead  of 
the  logiV  single  bit  reads/writes.  Thus  the  simplest  atomicity  requirement  is  that  the 
write  of  the  single  bit  tag  must  be  atomic  per  each  write  of  a  single  log  A^-bit  word. 

Remark  6.3  This  approach  is  consistent  with  the  restartable  fail-stop  PRAM,  where 
since  synchronous  restarts  cannot  occur  in  the  middle  of  a  read  or  a  write  of  a  word.  In 
that  setting,  the  above  simulation  can  be  used  as  is  with  restartable  processors. 


Chapter  7 

Discussion  and  Open  Problems 


WE  presented  a  study  fault  tolerance  and  efficiency  for  two  models  of  fault- 
prone  parallel  computation:  fail-stop  no-restart  PRAM  and  restartable  fail-stop 
PRAM.  Both  models  are  direct  extensions  of  the  standard  PRAM  model,  so  that  aU 
existing  algorithms  can  be  executed  on  either  model  without  any  changes  in  the  ab¬ 
sence  of  failures.  When  failures  are  introduced  or  when  the  shared  memory  is  initially 
contaminated,  existing  parallel  algorithms  can  be  mechanically  transformed  so  that  the 
algorithms  become  fault-tolerant  while  the  efficiency  is  degraded  slightly  for  a  large 
family  of  failure  patterns.  We  furthermore  can  take  advantage  of  parallel  slack.  In 
the  fail-stop  no-restart  model,  we  can  simulate  algorithms  so  that  the  work  is  preserved 
asymptotically.  In  the  restartable  model  the  work  is  preserved  for  as  long  as  the  number 
of  failures  per  each  processor  is  logarithmic  per  each  simulated  step. 

The  area  of  efficient  and  fault- tolerant  parallel  computation  remains  a  fertile  ground 
for  further  research  and  improvements  of  the  existing  results: 

Model  building:  The  definitions  of  the  two  models  of  computation  that  we  studied, 
the  fail-stop  PRAM  and  the  restartable  fail-stop  PRAM,  are  obtained  by  combin¬ 
ing:  (i)  a  model  of  parallel  processing  (i.e.,  shared-memory  based  multiprocessing), 
and  (ii)  an  associated  model  of  failures  (e.g.,  failure  types,  strength  of  adversaries, 
granularity  of  failures  and  frequency).  In  order  to  evaluate  the  efficiency  of  var¬ 
ious  fault-tolerant  algorithmic  methods  for  the  defined  models,  we  have  defined 
new  and  generalized  existing  measures  of  complexity  (i.e.,  overhead  ratio  available 
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processor  steps  and  overall  work).  We  have  also  showed  that  the  fail-stop  CREW 
PRAM  does  not  admit  efficient  solutions  to  the  Write- All  problem. 

Productive  and  promising  research  areas  include  identifying  further  natural  models 
with  the  goal  of  classifying  the  models  that;  (a)  either  admit  efficient  fault-tolerant 
algorithms,  or  (b)  that  are  inherently  prohibitive  to  efficient  computation. 

For  the  update  cycles  that  we  use  in  the  restartable  model,  it  is  interesting  to 
determine  the  minimum  number  of  reads  and  writes  necessary  to  enable  the  exis¬ 
tence  of  efficient  algorithms.  Other  questions  of  merit  include:  What  is  the  precise 
relationship  between  the  complexity  of  problems  (as  opposed  to  algorithms)  on  the 
two  models  presented  here?  Finally,  are  there  efficient  algorithms  for  important 
problems  that  can  be  derived  independently  and  do  not  come  from  simulation  or 
transformation  of  synchronous  PRAM  algorithms? 

Algorithms  and  upper  bounds:  We  designed  and  analized  several  efficient  and 
fault-tolerant  algorithms  for  the  parallel  models  studied. 

The  design  of  even  more  efficient  algorithms  subject  to  the  constraints  of  efficiency, 
reliability,  scalability  and  feasibility  remains  a  challenging  topic  for  research.  More 
efficient  algorithms  could  be  developed  for  the  Write- All  problem  that  serves  as  the 
basis  for  general  algorithm  simulations,  and  also  in  order  to  improve  the  efficiency 
of  naaVe  general  simulations.  There  is  still  a  log  A^/ log  log  A  gap  that  remadns 
between  the  most  efficient  known  deterministic  Write-All  algorithm,  i.e.,  W,  and 
the  corresponding  lower  bound,  i.e.,  0{N\o%N)  ([61]). 

Another  open  problem  is  to  determine  the  overhead  ratio  cr  for  algorithm  X  in 
the  original  setting  of  failures  and  restarts. 

Recently,  an  existence  proof  for  an  algorithm  achieving  work  w«is  given 

in  [8].  Is  0(A^log^^^^  N)  completed  work  for  solving  Write- All  with  N  processors 
and  input  of  size  N  achievable  in  the  restartable  fail-stop  model? 

Lower  bounds:  We  have  shown  an  Sl{N  log  N)  lower  bound  (when  N  =  P)  for  the 
Write- All  problem  in  the  restartable  fail- stop  model  under  the  assumption  that 
processors  can  read  and  locally  process  the  entire  shared  memory  at  unit  cost, 
i.e.,  the  memory  snaphsot  assumption.  Under  this  assumption,  this  is  the  best 
possible  lower  bound. 


123 


Under  the  same  assumption  we  showed  an  il{N  log  N/loglogN)  lower  bound  for 
the  no-restart  fail-stop  model,  and  that  this  is  the  best  possible  bound. 

In  order  to  further  improve  these  bounds,  the  strong  assumption  of  memory  snap¬ 
shots  must  be  removed.  There  was  some  progress  in  this  direction.  Under  different 
assumptions,  an  fl{N  log  N)  lower  bound  is  shown  for  failures  without  restarts  in 
[61]. 

Can  these  lower  bounds  be  further  improved?  For  the  no-restart  fail-stop  model, 
the  improvements  can  be  modest  at  best,  since  only  a  log  N/  log  log  N  gap  remains 
between  the  upper  and  lower  bounds. 

Experimental  analysis:  The  design  of  new  and  the  analysis  of  existing  fault-tolerant 
parallel  algorithms  can  be  aided  by  using  experimentation.  Algorithm  animation 
[25,  95]  has  the  promise  of  providing  additional  insights  into  aJgortihms’  behavior 
through  visualization.  A  tool  for  animating  Write- All  algorithms  was  developed 
by  Apgar  [9]  using  Stasko’s  TANGO  system  [95].  Using  the  that  animation,  an 
observer  can  monitor  the  progress  of  a  parallel  computation  and  dynamically  inject 
processor  faults  and  restarts. 

Concluding  remarks 

It  is  often  claimed  that  distributed  computing  systems  have  the  potential  advantage  of 
higher  reliability  over  centralized  systems.  This  advantage  of  distributed  computing  can 
be  applied  also  to  parallel  systems,  because  the  fault-tolerance  in  distributed  systems  is 
precisely  due  to  the  replication  of  resources.  The  resulting  redundancy  in  computation 
is  a  trade-off  of  efficiency  (measured  as  all  available  resources)  for  fault-tolerance. 

The  fundamental  question  we  studied  in  the  context  of  parallel  algorithms  depend 
on  the  exact  form  of  this  trade-off,  in  summary: 

How  can  the  reliability  advantage  of  distributed  computing  be  combined 
with  the  speed-up  potential  of  parallel  computing? 

Many  interesting  related  work  areas  can  be  defined,  and  we  believe  that  there  are 
worthy  research  areas  beyond  the  immediate  results  of  this  thesis.  Our  framework 
may  be  generalized  in  many  ways  based  on  assumptions  about  system  architecture  and 
structure  of  faults.  Some  concrete  and  relevant  questions  that  seem  promising  follow: 
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1.  Formulate  a  common  framework  for  merging  our  results  on  fault- tolerant  paral¬ 
lelism  and  the  existing  results  on  distributed  network  protocols  in  the  presence  of 
topological  changes. 

2.  What  about  randomized  robust  parallel  computation?  Could  a  formalization  along 
the  lines  of  [70]  be  used?  Another  good  area  here  seems  to  be  the  average  case 
analysis  of  fail-stop  CREW  processors  [73]. 

3.  Evaluate  the  feasibility  of  implementing  fault- tolerant  algorithms  based  on  differ¬ 
ent  multiprocessing  paradigms. 

4.  In  actual  multiprocessor  practice,  threads  packages  provide  a  basis  for  the  imple¬ 
mentation  of  a  wide  variety  of  parallel  paradigms,  e.g.,  [22,  36].  The  available 
threads  packages  typically  support  shared-memory  lightweight  processes.  How  are 
parallel  programs,  implemented  using  threads  packages,  affected  by  processor  fail¬ 
ures?  How  can  the  fault-tolerance  that  can  be  built  into  the  threads  packages 
themselves.  What  fault-tolerant  programming  methodologies,  can  be  designed  for 
the  commonly  used  threads  packages. 

5.  Fault-tolerant  multiprocessor  scheduling  -  develop  the  models  and  strategies  for 
efficient  and  fault- tolerant  processor  scheduling  (e.g.,  the  three  processor  allocation 
paradigms)  and  concurrency  control  for  the  purposes  of  robust  computation  and 
preservation  of  invariants  of  persistent  data  structures. 

6.  Robust  multiprocessing  software  package  -  a  practical  goal  could  be  to  develop 
a  multiprocessing  threads  package  based  on  the  accumulated  research  and  experi¬ 
mental  results. 
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Appendix  A 

Pseudocode  for  algorithm  W  and 
Two  Lemmas 


The  first  four  sections  of  this  chapter  contain  the  detailed  pseudocode  for  algorithm 
W  and  brief  comments.  Fifth  section  contains  a  formal  proof  of  Lemma  3.2.  The 
final  sixth  section  gives  an  alternative  proof  of  Lemma  3.4  that  was  communicated  by 
Martel  [71]. 
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A.l  Main  Procedure  for  Algorithm  W 


The  main  loop  in  Figure  A.l  consists  of  the  four  phases  outlined  in  Section  3.2.1.  Pro¬ 
cessor  counting  and  enumeration  is  implemented  as  a  static  bottom  up  traversal  in 
procedure  S-BU()  in  Appendix  A. 2,  work  assignment  is  done  in  a  dynamic  top  down 
traversal  in  procedure  D_TD()  in  Appendix  A.4,  the  work  itself  is  a  simple  assignment 
“x[k]:=l”,  and  the  progress  is  measured  via  a  dynamic  bottom  up  traversal  in  procedure 
D_BU()  given  in  Appendix  A. 3.  Parameter  passing  is  by  reference  in  aU  cases. 


forall  processors  PID=1..N 
parbegin 

shared  integer  array 

x[l..N], - input  array 

c[1..2N-l], - processor  counts 

cs[1..2N-l],  — count  step  numbers 
d[l  ..2N-1],  — progress/done  tree 
a[1..2N-l];  — accounted  tree 
private  integer 

pn,  — dynamic  processor  no. 

k,  — array  index  PID  will  be  assigned  to 

step;  — time  stamp 

step  :=  0;  — initial  processor  counting  step 
k  :=  PID;  — initially  work  data  item  PID 
x[k]  :=  1;  — visit  leaf 
D-BU(k);  — measure  progress 

— Main  loop 
while  d[l]  ^  N  do 

S.BU(PlD,step,pn);  — enumerate  proc-s 
D.TD(pn,k);  — assign  work 
x[k]  :=  1;  — do  work:  visit  leaf 
D-BU(k); - measure  progress 

od 

parend  ; 


Figure  A.l:  Main  procedure  of  algorithm  W. 
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A. 2  Static  Bottom  Up  TVaversal 


This  procedure  is  given  in  Figure  A.2.  All  processors  traverse  heap  c  to  compute  the 
overestimate  of  the  number  of  processors  in  c[l],  and  each  processor  computes  its  pro¬ 
cessor  number  pn  that  is  used  in  the  work  assignment  phase.  The  heap  cs  is  used  to 
synchronize  processor  counting  across  multiple  calls  to  S_BU(). 


procedure  S-BU(integer  PID,  — processor  id 
integer  step,  — timestamp 
integer  pn)  — processor  no. 
shared  integer  array 

c[1..2N-l],  — processor  counts 
cs[1..2N-l];  — count  step  numbers 
private  integer 

jl,j2, - siblings  indices 

t;  — parent  of  jl  and  j2 

step  :=  step  -|-  1;  — new  time  stamp 
jl  :=  PID  -f  (N-1);  — heap-leaf  init 
pn  :=  1;  — assume  this  processor  is  no.  1 
cpl]  :=  1; 

cs[jl]  ;=  step;  — count  the  processor  once 

- Traverse  the  tree  from  leaf  to  root 

for  l..log(N)  do 

t  :=  jl  div  2; - parent  of  jl  and  jS 

if2n  =  jl 

thenj2  :=  jl  -i-  1  — jl  came  from  left 
else  j2  :=  jl  —  1  — jl  came  from  right 

fi; 

ifcs[jl]  =  cs[j2]  — both  sub-trees  active? 
then  c[t]  :=  c[jl]  -I-  c[j2]  — both  active 
ifjl  >  j2  — jl  came  from  right 
then  pn  :=  pn  -f  c[j2] 

fi 

else  c[t]  :=  c[jl]  — all  siblings  failed 

fi; 

cs[t]  :=  step;  — time  stamp,  and 
jl  :=  t  — advance  up  the  heap 
od 
end  ; 


Figure  A.2:  Phase  W1  procedure  —  Static  bottom  up  traversal 
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A. 3  Dynamic  Bottom  Up  Traversal 


This  procedure  is  given  in  Figure  A.3.  Heap  d  contains  the  underestimates  for  the 
number  of  leaves  visited  in  each  subtree,  with  d[l]  containing  the  underestimate  of  the 
total  number  of  leaves  visited.  This  number  is  used  in  terminating  the  overall  program 
(when  d[l]=N). 


procedure  D_BU(integer  k  — current  leaf 

) 

shared  integer  array 

d[1..2N-l];  — done/progress  tree 
private  integer 

il,  i2,  — siblings  indices 
t;  — parent  of  il  and  i2 

il  k  +  (N-i);  — heap-leaf  inii. 
d[il]  ;=  1;  — done  for  good 

—  Traverse  the  tree  front  leaf  to  root 
for  l..Iog(N)  do 

t  ;=  il  div  2;  — parent  of  il  and  i2 

—  compute  left/right  indices 
if  2*t  =  il 

then  i2  :=  il  +  1  — jl  came  from  left 
else  i2  il  —  1  — jl  came  from  right 

fi; 

d[t]  :=  d[il]  +  d[i2]; - update  progress 

il  :=  t - advance  to  the  predecessor 

od 
end  ; 


Figure  A. 3:  Phase  W4  procedure  —  Dynamic  bottom  up  traversal 
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A.4  Dynamic  Top  Down  Traversal 


This  procedure,  given  in  Figure  A.4,  implements  load  rescheduling  of  the  remaining 
active  processors.  Heaps  c  and  d  are  traversed  top  down.  Heap  a  is  used  to  traverse  paths 
to  the  unaccounted  leaves  according  to  the  discipline  implemented  by  this  algorithm. 
Heap  c  is  used  to  partition  the  remaining  processors  between  the  left  and  right  tree 
branches,  and  heap  d  contains  the  progress  information  for  the  subtrees  being  traversed. 
Processors  are  allocated  in  proportion  to  the  remaining  work. 


procedure  D_TD(mteger  pn  — dynamic  processor  no. 
integer  k)  — data  item 

shared  integer  array 

c[1..2N-l],  — processor  counts 
d[1..2N-l],  — progress/done  tree 
a[1..2N-l];  — accounted  tree 
private  integer 

j,  jl,  j2;  — current/left/right  indices 

j  :=  I;  — start  at  the  root 

size  :=  N;  — the  whole  tree  is  visible 

a[l]  :=  d[l];  — no.  of  all  accounted  nodes 

—  traverse  from  root  to  leaf 
while  size  ^  1  do 

jl  :=  2*j;  j2  :=  jl  +  1;  — left/right  indices 

—  compute  accounted  node  values 
ifd[jl]+dp2]  =  0-*  a[jl]:=0 

0d[il]+d[j2]  ^  0  -►  a[jl)  ;=  a[j]*d[jl]  div  (d[jl]+d[j2]) 

fi 

a[j2]  ;=  ap]-a[jl]; 

— processor  alloc,  to  left/right  sub-trees 
c[jl]  ;=  c[i]*(size/2-a^l])  div  (size-a[j]); 
c[j2j  :=  c[jj  -  c[jl]; 

—  go  left/rxght  based  on  proc.  no. 
if  pn  <  c[jl]  — ►  j  :=  jl 

D  pn  >  c|jl]  j  :=  j2;  pn  :=  pn  -  c[jl] 

fi; 

size  :=  size  div  2  — half  of  leaves  visible 

od  ; 

k  :=  j  -  (N-1)  — assign  processor  based  on  j 
parend _ _ _ _ 


Figure  A.4:  Phase  W2  procedure  —  Dynamic  top  down  traversal 
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A. 5  Dynamic  Dop  Down  Traversal  Lemma  3.2 

The  next  lemma  shows  that  the  algorithm  correctly  achieves  the  desired  load  balancing. 

Lemma  3.2  loop-iteration  i  of  algorithm  W:  (1)  processors  are  only  allocated  to 
unaccounted  leaves,  and  (2)  no  leaf  is  allocated  more  than  \Ri/Ui  \  processors. 

Proof:  The  proof  outline  of  the  while  loop  of  the  algorithm  in  the  style  of  Dijkstra 
[35]  is  given  in  Figure  A. 5.  We  make  use  of  the  heap  definitions,  and  the  property  that 
results  from  the  dynamic  bottom  up  traversal  that  if  d[j]  <  size,  where  size  is  the 
number  of  leaves  in  the  subtree  rooted  at  j,  then  d[2j],d[2j  +  1]  <  sizel2  (there  are  no 
more  visited  leaves  than  there  are  leaves). 

At  the  beginning  of  phase  W2,  Ri  is  the  value  of  c[l],  and  {/,  is  iV— d[l].  We  define  q  as 
\RilUi\.  The  top  down  traversal  is  executed  synchronously  by  all  surviving  processors, 
with  the  processors  writing  identical  values  when  writing  concurrently.  Therefore  a 
sequential  programming  calculus  for  each  processor  can  be  used.  We  prove  the  following 
while  loop  invariant: 

I  :  (a[j]  <  size)  A  (a[j]  <  d[j]) 

A((g  -  \){size  -  a[j])  <  c[j]  <  q{size  -  a[y])) 

A(1  <  pn  <  c[j]) 

This  invariant  is  established  by  the  assignments  to  j,  size,  and  a[l],  performed  in 
the  state  where  d[l]  <  N  by  the  main  loop  termination  condition,  1  <  pn  <  cfi]  by  the 
processor  enumeration  in  phase  Wl,  and  {q  —  1)(A  —  d[l])  <  c[l]  <  q{N  —  d[l])  by  the 
properties  of  ceiling. 

We  also  make  use  of  the  following  abbreviation: 

■  (?  -  1)(«  -  a[j])  <  c[j]  <  ?(«  -  a[j])- 

It  is  straightforward  to  verify  that  after  /  is  established,  it  is  left  invariant  by  the 
while  loop.  Some  inference  detail  is  ommited  in  the  proof  outline  (Figure  A. 5)  for 
readability.  When  a  processor  completes  dynamic  top-down  traversal,  the  invariant  I 
holds,  and  size  =  1.  The  last  assertion  of  the  proof  outline  implies  the  following  result: 


{a[j]  =  0)  A  {q-\  <  c{j]  <  9)  A  (1  <  pn  <  c[j]) 
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{  rf[l]  <  N  — by  tke  main  loop  termination  condition 

A1  <pn<  c[j]  — by  the  processor  enumeration  of  phase  1 
A{q  —  l)(N  —  d[l])  <  c[l]  <  q{N  —  «f[l])  — by  the  properties  of  ceiling  } 
j,size,a[l]  :=  l,N,d[l]; 

{  I  }  — invariant  is  initially  established 
while  size  1  do 

jl  :=2*j;j2  :=jl  +  1; 

if  d[jl]+d[j2]  =  0  — ►  a[jl]  :=  0 

D  d[jl]+d[j2j  ^  0  -♦  a[jl]  :=  a[j]*d[jl]  div  (d[jl]+d[j2]) 

fi 

{  /  A  (a[jl]  <  d[jl]  A  a[jl]  <  size/2)  } 
a[j2]  :=  a[j]  -  a[jl]; 

{  /I  ;  7  A  (a[jl]  <  d\j\]  A  a[jl]  <  s»2e/2) 

A(a[i2]  <  (7[j2]  A  a[j2]  <  size/2)  A  (a[i]  =  a[;l]  +  a[j2])  } 
cOl]  :=  c[j]*(size/2-a[jl])  div  (size-ajj]); 

{  12  :  71  AQ0l,st'2e/2)  } 
c[j2]  :=  c[j]  -  c[jl]; 

{  73  :  71  A  Q{jl,  size/2)  A  Q(j2,  size/2)  A  c\j]  =  c[jl]  +  c[j2]  } 

if  pn  <  c[jl]  -♦  {  73  Apn  <  c[jl]  }{  73  A  a[;l]  ^  size/2  }  j  :=  jl 

Dpn  >  c[jl]  -*  {  73  Apn  >  c\j\]  }{  73  Aa[;2]  ^  size/2  }  j  :=  j2;  pn  :=  pn  -  c|jl] 

fi; 

{  I  size/2  replacing  size 

size  :=  size  div  2; 

{  7  }  — invariant  is  preserved  by  each  iteration 

od  ; 

{  7  A  size  =  1  }  {  a[j]  <  1  A  Q{j.  1)  A  1  <  pn  <  c[j]  } _  _ 


Figure  A. 5;  Proof  outline  of  the  phase  W2  top  down  traversal 


The  safety  property  that  each  processor  correctly  constructs  branches  of  the  ac¬ 
counted  tree  using  heap  a[1..27V  —  1]  satisfying  the  constraints  given  in  Section  3.2.1 
follows  directly  from  the  proof  outline  by  a[l]  =  d[l]  and  for  an  interior  node  ji, 
a[j]  =  a[2j]  +  a[2j  -f  1]  with  a[j]  <  d[j]  for  all  nodes  j  (the  values  of  the  a  heap 
are  computed  once  and  not  changed  by  the  top  down  traversal).  The  0(logA'^)  time 
termination  for  all  surviving  processors  follows  from  the  initialization  and  assignments 
to  size,  and  the  while  guard. 


Thus,  if  a  processor  completes  phase  W2  then  it  would  reach  a  leaf  with  a[j]  =  0. 
that  is  an  unaccounted  leaf,  and  at  most  q  =  \Ri/Ui\  processors  would  reach  the  same 
leaf.  □ 
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A. 6  Martel’s  Improved  Lemma  3.4 

Lemma  3.4bis.  For  any  failure  pattern  with  at  least  one  surviving  processor,  algorithm 
W  completes  all  remaining  work.  Its  total  number  of  block-steps  V]  is  less  than  or  equal 
to  U  +  P  log  U /  log  log  U ,  where  P  is  the  initial  number  of  processors  and  U  is  the  initial 
number  of  unvisited  elements. 

Proof:  Consider  the  ith  iteration  of  the  main  loop  of  algorithm  W,  as  we  did  in  the 
analysis  of  the  algorithm  in  Section  3.2.1. 

At  the  beginning  of  the  iteration,  Pi  is  the  ovf'restimate  of  active  processors,  and  Ui 
is  the  estimated  remaining  unvisited  leaves.  At  the  end  of  the  iteration  (i.e.,  at  the 
beginning  of  the  i-|-  iteration,  the  corresponding  values  are  P,+i  and  From  the 

analysis  of  algorithm  W  we  know  that  P,  >  and  Ui  >  Ui+i-  Let  also  P\  —  P  and 
L’l  =  U. 

If  a  processor  begins,  but  does  not  complete  an  iteration  of  the  loop,  we  are  going 
to  (over)charge  the  processor  log  Ui  steps.  Such  charges,  call  them  Bq,  will  amount  to 
no  more  than  Bq  =  0(P)  block-steps  (and  thus  no  more  than  C>(Plog  U)  overall  work 
steps)  for  a  particular  execution  of  the  algorithm.  Having  taken  care  of  this  accounting, 
we  are  only  going  to  account  for  the  iterations  that  were  completed  by  the  participating 
processors  in  the  following  discussion. 

We  will  treat  the  three  log  U  time  tree  traversals  performed  by  a  single  processor 
during  each  phase  of  the  algorithm  as  a  single  block-step  of  cost  0(log  17).  We  will  charge 
each  processor  for  each  such  completed  block  step. 

Let  T  be  the  final  iteration  of  the  algorithm,  i.e.,  Ur  >  0  and  the  number  of  unvisited 
elempnts  after  the  iteration  r  is  Ur+\  =  0.  We  examine  the  following  two  major  cases: 

1.  Consider  a// block  steps  in  which  Pi  <  Ui  : 

By  the  balanced  processor  allocation  of  algorithm  W ,  each  leaf  wiU  be  assigned 
no  more  than  1  processor,  therefore  the  number  of  block  steps  B\  accounted  in 
this  case  will  be  no  more  than  B\  <  ^i=j(Ui  -  U,+i)  =  Ui  -  Ut+t  =  U  -  0  =  U  . 

2.  Now  consider  all  block  steps  in  which  Pj  >  Ui,  with  the  following  two  subcases  : 
(2. a)  Consider  aU  block  steps  after  which  <  logtVl^'glogt/  ' 
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This  could  occur  no  more  than  times  since  Ui+i  <  U\  =  U .  No  more 

than  P  processors  complete  such  block  steps,  therefore  the  total  number  of  blocks 
B2.a  accounted  by  this  sub-case  is  bounded  by:  B2.0  = 

(2.b)  Finally  consider  all  block  steps  such  that  P,  >  Ui  and  Ui+i  >  iogt//li'glogt/  • 

Consider  a  particular  iteration  i.  By  Lemma  3.2,  at  most  but  no  less  than 
processors  were  assigned  to  each  of  the  Ui  unvisited  leaves.  Therefore,  the 
number  of  failed  processors  is  at  least 


IT,  I  ^  I  >  _ LL _ ,  .  f’t  > _ 3 _ 

*^'1-1-1  Lf/,  J  —  log  U  /  log  log  U  2U,  —  2  log  I// log  log  f/  ■ 

This  can  happen  no  more  than  r  times.  The  number  of  processors  completing  step 
i  is  no  more  than  Pi{l  —  ^  lolt'  )•  I"  general  the  number  of  processors  completing 


‘log  log  U 


occurrence  of  case  (2.b)  will  have  no  more  than  P(1  —  where  P  is 

log  log  U 

the  initial  number  of  processors. 


Therefore  the  number  of  blocks  B2.6  accounted  by  this  sub-case  is  bounded  by: 


^2.6  <  51  “  g  logt;  -  g  logU  y  - 

j=l  2  log  log  U  i=l  loglog  V  ^  2,-^ 


rr-) 

log  log  u 


=  f’-^iSiu  =  o{p 


log  log! 


t). 


The  total  number  of  block  steps  B  of  all  cases  considered  is: 
fi  =  fio  +  Pi  +  P2.a  +  P2.6  =  0{U  A  Pj^gp)  .  □ 

Theorem  3.9  Algorithm  VF  is  a  robust  parallel  algorithm  for  the  Write-All  problem 
with  S'  =  0(Alog^  A'/loglog  A"),  where  N  is  the  input  array  size,  and  the  initial 
number  of  processors  P  is  between  1  and  TV. 

Proof:  When  \  <  P  <  U  =  N ,  each  block-step  is  performed  in  0(log  N)  time  with  one 
array  element  at  each  leaf.  Therefore 


5=  fl  O(logA')  =  O(iVlogW  +  A',J|0,)  =  O(IV,i^).  □ 
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Appendix  B 

Algorithm  X  pseudocode 


Here  we  give  detailed  pseudocode  for  algorithm  X  on  the  restartable  fail-stop 
model.  In  the  pseudocode,  the  action,  recovery  end  construct  of  [90]  is  used 
to  denote  the  actions  and  the  recovery  procedures  for  the  processors.  In  the  algorithm 
this  signifies  that  an  action  is  also  its  own  recovery  action,  should  a  processor  fail  at 
any  point  within  the  action  block. 

The  notation  “{(F/£)))[^log(fc)j”  is  used  to  denote  the  binary  true/false  value  of  the 
[log(A:)J-th  bit  of  the  log(A)-bit  representation  of  PID,  where  the  most  significant  bit 
is  the  bit  number  0,  and  the  lea^t  significant  bit  is  bit  number  logiV. 

The  act  ion /recovery  construct  can  be  implemented  by  appropriately  checkpointing 
the  instruction  counter  in  stable  storage  as  the  last  instruction  of  an  action,  and  reading 
the  instruction  counter  upon  a  restart.  This  is  amenable  to  automatic  implementation 
by  a  compiler. 

It  is  possible  to  perform  local  optimization  of  the  algorithm  by:  (i)  evenly  spacing 
the  P  processors  N / P  leaves  apart  by  when  P  <  N,  and  by  (ii)  using  the  integer  values 
at  the  progress  tree  nodes  to  represent  the  known  number  of  descendent  leaves  visited 
by  the  algorithm.  Our  worst  case  analysis  does  not  benefit  from  these  modifications. 

The  algorithm  can  be  used  to  solve  Write- All  “in  place”  using  the  array  x[]  as  a  tree 
of  height  log(A/2)  with  the  leaves  i[iV/2..A  -  1],  and  doubling  up  the  processors  at  the 
leaves,  and  using  x[A]  as  the  final  element  to  be  initialized  and  used  as  the  algorithm 
termination  sentinel.  With  this  modification,  array  d[]  is  not  needed.  The  asymptotic 
efficiency  of  the  algorithm  is  not  affected. 
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forall  processors  PID  =  0..P  —  1  parbegin 
shared  x[l.  JV];  — shared  memory 

shared  —  1];  - “done”  heap  (progress  tree) 

shared  u;[0..P—  1];  —  “where”  array 

private  done,  where;  — current  node  done/where 

private /e/<,  right;  - left/right  child  values 

action  .recovery 

w[PID]  :=  1  +  PID; - the  initial  positions 

end  ; 

action, recovery 

while  w[PID]  ^  0  do  — while  haven’t  exited  the  tree 

where  :=  w[PID]; - current  heap  location 

done  :=  d[where];  — doneness  of  this  subtree 

if  done  then  w[PID]  :=  where  div  2;  — move  up  one  level 

elseif  not  done  A  where  >  N  —  I  then  — at  a  leaf 

if  x[where  —  AT]  =  0  then  x[where  —  N] 1;  — initialize  leaf 
elseif  x[where  —  A^]  =  1  then  d[it;/iere]  :=  1;  — indicate  “done” 
fi 

elseif  not  done  A  where  <  A^  —  1  then  — interior  tree  node 

left  :=  d[2  +  le/iere];  right  :=  d[2*  where  +  1];  — left/rxght  child  values 

if  left  A  right  then  d[to/iere]  :=  1; - both  children  done 

elseif  not  left  A  right  then  w[P]D]  :=  2*  where; - go  left 

elseif  left  A  not  right  then  w\PID]  :=  2  ♦  where  +  1;  — go  right 

elseif  not  left  A  not  right  then - both  subtrees  are  not  done 

- move  down  according  to  the  PID  bit 

if  not  ((P/0))[iog(u;ft«re)j  then  w\PID]  ;=  2  ♦  where  ;  — move  left 
elseif  {(P/0(([iog(ti//ier«)j  then  w{PID]  :=  2  *  where  +  1;  — move  right 
fi 
fi 
fi 
od 
end 
parend  . 

Figure  B.l:  Algorithm  X  detailed  pseudo-code. 


Appendix  C 

Mathematical  lemmas  used  for 
lower  bounds 


Lemma  C.l  Given  a  sorted  list  of  m  (m  >  1)  nonnegative  integers  cj, 02, . .  .,0^  then 

•  m  m 

we  have  for  all  j  (1  <  j  <  m)  that  (1 - )^a,  <  o,  . 

t=l 

Proof:  We  proceed  by  induction  on  m.  Base  case  is  trivial  for  y  =  1  <  2  =  m.  Using 
inductive  hypothesis  (1  -  we  show  that  (1  -  a,  < 

J2x=j+\  by  extending  the  sorted  list  of  k  elements  by  the  new  element  in  the 
following  straightforward  transformations: 

ai  <  ajt+i  (for  any  i,  1  <  i  <  fc),  and  so  Yli=\  Ot  <  (fc  +  l)afc+i 

kh  <  Jafc+i  (for  each  j,  I  <  j  <  k) 

~jO'k+\  +  jipi  IZj=i  ®«  ^  0 

kttk+T,  -  jak+i  +  Yii=i  <  kak+i  (by  adding  kuk+i  to  both  sides) 

{k  -  j)ak+-i  +  <  fcojt+i 

(1  -  i)  ELi  a.  +  (1  -  i)ak+i  +  Ef=i  a,  <  Ojt+i  +  E,-=j+i  «.  (by  dividing 

both  sides  by  positive  k  and  using  the  inductive  hypothesis.) 

( 1  -  i) Ef=/  +  j(i  “  kh)T,i=i  O’i  <  Ef=j+1  «.  (after  grouping  terms),  finally 
( 1  -  i+t)  Ef=i  Oi  <  Efz^j+i  simplifying  the  inequabty).  □ 


147 


148  APPENDIX  C.  MATHEMATICAL  LEMMAS  USED  FOR  LOWER  BOUNDS 


Lemma  C.2  Given  G  >  1,  N  >  G,  and  integer  a  such  that 
foUowing  inequality  holds: 

[...[[N/G\/G\.../G\>0  H 

a  times 

(where  a  is  the  number  of  times  that  the  expression  in  the  left  hand  side  of  the  inequality 
contains  division  by  G). 


Proof:  To  show  (+),  it  suffices  to  show  that,  after  dropping  one  floor  and  strengthening 
the  inequality:  ([. . .  [[N  /G\/G\  .  ../G\  /G)-l  >  0,  or  that  [. . .  /GJ/GJ  . .  ./GJ  > 

' - V - '  ' - v; - ' 

<T—1  times  times 

G. 

Applying  such  transformations  for  o'  —  1  more  steps,  we  get  that  it  suffices  to  show: 
N  >  G°  +  G°~^  +  . . .  +  G,  or  A  >  using  summation  for  geometric  progressions. 

We  observe  that  thus  it  is  enough  to  show  that  N  >  G'^'*'^ .  After 

taking  logarithms  of  both  sides  we  get  log  A  >  (o^  +  l)logG,  and  so  is  suffices  to  have 


To  achieve  the  results  in  Section  4,  this  lemma  is  used  with  G  =  log  A,  i.e.,  a  < 
i°g^  _  1 

log  log 


Lemma  C.3  For  N 


00 


(1- 


log  A 


loK  N 

^  log  log  N  =  1  _ 


.1 


+  G( 


log  log  A  (log  log  A  )2 


) 


Proof:  The  proof  is  done  using  standard  techniques  for  manipulating  asymptotics  (e.g. 
Graham  et  al.  [46]). 

i  log  ^  - 

Let  A  =  (1  -  ^ fi  =  ln(l  —  where  In  is  the  natural  logarithm. 

Using  ln(l  -  2)  =  expansion  around  0,  we  get  B  =  — 

for  A 


1 

31og^  AT 


00. 


From  this  we  get: 

In  ^  -  21ogNloglogAr  -  31og^ArlglogN  " 


^  log  log  N  2  log  A^  log  log  A  3  log^  N  log  log  N  1®^'®) 

2  3 

Using  exp(2)  =  1  +  2+  ^  +  ^  +  ...  expansion  around  0  and  after  multiplying  the 
resulting  series  and  gathering  the  terms,  we  get: 

^  =  ^  -  logllgN  +  2(logllgNF  ”  ° 


